-
[pdf]
[supp]
[bibtex]@InProceedings{Wang_2025_CVPR, author = {Wang, Xihua and Song, Ruihua and Li, Chongxuan and Cheng, Xin and Li, Boyuan and Wu, Yihan and Wang, Yuyue and Xu, Hongteng and Wang, Yunfeng}, title = {Animate and Sound an Image}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {23369-23378} }
Animate and Sound an Image
Abstract
This paper addresses a promising yet underexplored task, Image-to-Sounding-Video (I2SV) generation, which animates a static image and generates synchronized sound simultaneously. Despite advances in video and audio generation models, challenges remain to develop a unified model for generating naturally sounding videos. In this work, we propose a novel approach that leverages two separate pretrained diffusion models and makes vision and audio influence each other during generation based on the Diffusion Transformer (DiT) architecture. First, the individual video and audio pretrained generation models are decomposed into input, output, and expert sub-modules. We propose using a unified joint DiT block to integrate the expert sub-modules to effectively model the interaction between the two modalities, resulting in high-quality I2SV generation. Then, we introduce a joint classifier-free guidance technique to boost the performance during joint generation. Finally, we conduct extensive experiments on three popular benchmark datasets, and in both objective and subjective evaluation our method surpass all the baseline methods in almost all metrics. Case studies show our generated sounding videos are high quality and synchronized between video and audio.
Related Material