TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models

Ni, Haomiao; Egger, Bernhard; Lohit, Suhas; Cherian, Anoop; Wang, Ye; Koike-Akino, Toshiaki; Huang, Sharon X.; Marks, Tim K.

Haomiao Ni, Bernhard Egger, Suhas Lohit, Anoop Cherian, Ye Wang, Toshiaki Koike-Akino, Sharon X. Huang, Tim K. Marks; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 9015-9025

Abstract

Text-conditioned image-to-video generation (TI2V) aims to synthesize a realistic video starting from a given image (e.g. a woman's photo) and a text description (e.g. "a woman is drinking water."). Existing TI2V frameworks often require costly training on video-text datasets and specific model designs for text and image conditioning. In this paper we propose TI2V-Zero a zero-shot tuning-free method that empowers a pretrained text-to-video (T2V) diffusion model to be conditioned on a provided image enabling TI2V generation without any optimization fine-tuning or introducing external modules. Our approach leverages a pretrained T2V diffusion foundation model as the generative prior. To guide video generation with the additional image input we propose a "repeat-and-slide" strategy that modulates the reverse denoising process allowing the frozen diffusion model to synthesize a video frame-by-frame starting from the provided image. To ensure temporal continuity we employ a DDPM inversion strategy to initialize Gaussian noise for each newly synthesized frame and a resampling technique to help preserve visual details. We conduct comprehensive experiments on both domain-specific and open-domain datasets where TI2V-Zero consistently outperforms a recent open-domain TI2V model. Furthermore we show that TI2V-Zero can seamlessly extend to other tasks such as video infilling and prediction when provided with more images. Its autoregressive design also supports long video generation.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Ni_2024_CVPR, author = {Ni, Haomiao and Egger, Bernhard and Lohit, Suhas and Cherian, Anoop and Wang, Ye and Koike-Akino, Toshiaki and Huang, Sharon X. and Marks, Tim K.}, title = {TI2V-Zero: Zero-Shot Image Conditioning for Text-to-Video Diffusion Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {9015-9025} }