-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Lin_2025_ICCV, author = {Lin, Zongyu and Liu, Wei and Chen, Chen and Lu, Jiasen and Hu, Wenze and Fu, Tsu-Jui and Allardice, Jesse and Lai, Zhengfeng and Song, Liangchen and Zhang, Bowen and Chen, Cha and Fei, Yiran and Li, Lezhi and Yang, Yinfei and Sun, Yizhou and Chang, Kai-Wei}, title = {STIV: Scalable Text and Image Conditioned Video Generation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {16249-16259} }
STIV: Scalable Text and Image Conditioned Video Generation
Abstract
We present a simple and scalable text and image conditioned video generation method. Our approach, named STIV, integrates a variable number of image conditions into a Diffusion Transformer (DiT) through frame replacement. This design enables STIV to perform both text-to-video (T2V) and text-image-to-video (TI2V) tasks simultaneously, as well as long video generation through autoregressive rollouts.Additionally, STIV can be easily extended to various applications, such as video prediction, frame interpolation, and multi-view generation, etc.With comprehensive ablation studies on T2I, T2V, TI2V, and long video generation, STIV demonstrate strong performance, despite its simple design. An 8.7B model with (512^2) resolution achieves 83.1 on VBench T2V, surpassing both leading open and closed-source models like CogVideoX-5B, Pika, Kling, and Gen-3. The same-sized model also achieves a state-of-the-art result of 90.1 on VBench I2V task at (512^2) resolution. Combine all of these, we finally scale up our model to 540p with over 200 frames. By providing a transparent recipe for building cutting-edge video generation models, we aim to empower future research and accelerate progress for video generation.
Related Material
