MuseDance: A Diffusion-based Music-Driven Image Animation System

Zhikang Dong, Weituo Hao, Ju-Chiang Wang, Peng Zhang, Pawel Polak; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026, pp. 3813-3824

Abstract


Image animation is a rapidly developing area in multimodal research, with a focus on generating videos from reference images. While much of the work has emphasized generic video generation guided by text, music-driven dance image animation remains underexplored. In this paper, we introduce MuseDance, an end-to-end model that animates reference images using both music and text inputs. By integrating music as a conditioning modality, MuseDance generates personalized videos that not only adhere to textual descriptions but also synchronize character movements with the rhythm and dynamics of the music. Unlike existing methods, MuseDance eliminates the need for explicit motion guidance, such as pose sequences or depth maps, reducing the complexity of video generation while enhancing accessibility and flexibility. To support further research in this field, we present a new multimodal dataset comprising of 3,122 dance videos, each paired with the corresponding background music and text descriptions. Our approach leverages diffusion-based methods to achieve robust generalization, precise control, and temporal consistency, setting a new benchmark for the task of music-driven image animation. The dataset of this work is available at https://github.com/Dongzhikang/musedance.

Related Material


[pdf]
[bibtex]
@InProceedings{Dong_2026_WACV, author = {Dong, Zhikang and Hao, Weituo and Wang, Ju-Chiang and Zhang, Peng and Polak, Pawel}, title = {MuseDance: A Diffusion-based Music-Driven Image Animation System}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {March}, year = {2026}, pages = {3813-3824} }