ART-V: Auto-Regressive Text-to-Video Generation with Diffusion Models

Weng, Wenming; Feng, Ruoyu; Wang, Yanhui; Dai, Qi; Wang, Chunyu; Yin, Dacheng; Zhao, Zhiyuan; Qiu, Kai; Bao, Jianmin; Yuan, Yuhui; Luo, Chong; Zhang, Yueyi; Xiong, Zhiwei

Wenming Weng, Ruoyu Feng, Yanhui Wang, Qi Dai, Chunyu Wang, Dacheng Yin, Zhiyuan Zhao, Kai Qiu, Jianmin Bao, Yuhui Yuan, Chong Luo, Yueyi Zhang, Zhiwei Xiong; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 7395-7405

Abstract

We present ART-V an efficient framework for auto-regressive video generation with diffusion models. Unlike existing methods that generate entire videos in one-shot ART-V generates a single frame at a time conditioned on the previous ones. The framework offers three distinct advantages. First it only learns simple continual motions between adjacent frames therefore avoiding modeling complex long-range motions that require huge training data. Second it preserves the high-fidelity generation ability of the pre-trained image diffusion models by making only minimal network modifications. Third it can generate arbitrarily long videos conditioned on a variety of prompts such as text image or their combinations making it highly versatile and flexible. To combat the common drifting issue in AR models we propose masked diffusion model which implicitly learns which information can be drawn from reference images rather than network predictions in order to reduce the risk of generating inconsistent appearances that cause drifting. Moreover we further enhance generation coherence by conditioning it on the initial frame which typically contains minimal noise. This is particularly useful for long video generation. When trained for only two weeks on four GPUs ART-V already can generate videos with natural motions rich details and a high level of aesthetic quality. Besides it enables various appealing applications e.g. composing a long video from multiple text prompts.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Weng_2024_CVPR, author = {Weng, Wenming and Feng, Ruoyu and Wang, Yanhui and Dai, Qi and Wang, Chunyu and Yin, Dacheng and Zhao, Zhiyuan and Qiu, Kai and Bao, Jianmin and Yuan, Yuhui and Luo, Chong and Zhang, Yueyi and Xiong, Zhiwei}, title = {ART-V: Auto-Regressive Text-to-Video Generation with Diffusion Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {7395-7405} }