Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation

Cheng, Shihan; Kulkarni, Nilesh; Hyde, David; Smirnov, Dmitriy

Shihan Cheng, Nilesh Kulkarni, David Hyde, Dmitriy Smirnov; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 14811-14821

Abstract

Fine-tuning large-scale text-to-video diffusion models to add new generative controls, such as those over physical camera parameters (e.g., shutter speed or aperture), typically requires vast, high-fidelity datasets that are difficult to acquire. In this work, we propose a data-efficient fine-tuning strategy that learns these controls from sparse, low-quality synthetic data. We show that not only does fine-tuning on such simple data enable the desired controls, it actually yields superior results to models fine-tuned on photorealistic "real" data. Beyond demonstrating these results, we provide a framework that justifies this phenomenon both intuitively and quantitatively.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Cheng_2026_CVPR, author = {Cheng, Shihan and Kulkarni, Nilesh and Hyde, David and Smirnov, Dmitriy}, title = {Less is More: Data-Efficient Adaptation for Controllable Text-to-Video Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {14811-14821} }