-
[pdf]
[bibtex]@InProceedings{Baherwani_2026_CVPR, author = {Baherwani, Vatsal and Ren, Yixuan and Shrivastava, Abhinav}, title = {Timestep-Constrained One-Shot Video Motion Customization}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2026}, pages = {4845-4854} }
Timestep-Constrained One-Shot Video Motion Customization
Abstract
Video motion customization seeks to adapt a pre-trained text-to-video (T2V) model to the motion in reference videos and reproduce that motion with novel appearances. Unlike deterministic frame-wise video editing, motion customized models capture a motion concept and reinstantiate it with temporal diversity. Yet video diffusion models synthesize motion and appearance jointly through iterative denoising under a global objective, leading to entangled temporal and spatial signals. This issue is especially pronounced in the one-shot setting, where the customized model often memorizes both the reference motion and appearance, causing spatial leakage into the generated videos. In this work, we quantitatively investigate how motion and appearance are factorized across denoising timesteps through the proxy of the trade-off between appearance editing and motion preservation induced by injecting new conditions over specified timestep ranges. Across diverse architectures, we identify a consistent pattern where motion is established in early denoising steps and appearance is refined later, revealing a spatiotemporal boundary in timestep space. Motivated by this characterization, we simplify one-shot motion customization by restricting both training and inference to the motion-dominant timesteps. Our timestep-constrained recipe achieves clean motion transfer without auxiliary debiasing modules or specialized objectives, and can be readily integrated into existing motion customization frameworks regardless of model architecture.
Related Material

