BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models

Fengyuan Shi, Jiaxi Gu, Hang Xu, Songcen Xu, Wei Zhang, Limin Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 7393-7402

Abstract


Diffusion models have made tremendous progress in text-driven image and video generation. Now text-to-image foundation models are widely applied to various downstream image synthesis tasks such as controllable image generation and image editing while downstream video synthesis tasks are less explored for several reasons. First it requires huge memory and computation overhead to train a video generation foundation model. Even with video foundation models additional costly training is still required for downstream video synthesis tasks. Second although some works extend image diffusion models into videos in a training-free manner temporal consistency cannot be well preserved. Finally these adaption methods are specifically designed for one task and fail to generalize to different tasks. To mitigate these issues we propose a training-free general-purpose video synthesis framework coined as BIVDiff via bridging specific image diffusion models and general text-to-video foundation diffusion models. Specifically we first use a specific image diffusion model (e.g. ControlNet and Instruct Pix2Pix) for frame-wise video generation then perform Mixed Inversion on the generated video and finally input the inverted latents into the video diffusion models (e.g. VidRD and ZeroScope) for temporal smoothing. This decoupled framework enables flexible image model selection for different purposes with strong task generalization and high efficiency. To validate the effectiveness and general use of BIVDiff we perform a wide range of video synthesis tasks including controllable video generation video editing video inpainting and outpainting.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Shi_2024_CVPR, author = {Shi, Fengyuan and Gu, Jiaxi and Xu, Hang and Xu, Songcen and Zhang, Wei and Wang, Limin}, title = {BIVDiff: A Training-Free Framework for General-Purpose Video Synthesis via Bridging Image and Video Diffusion Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {7393-7402} }