Customized Condition Controllable Generation for Video Soundtrack

Fan Qi, Kunsheng Ma, Changsheng Xu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 23914-23924

Abstract


Recent advances in latent diffusion models (LDMs) have enabled data-driven paradigms for video soundtrack generation, improving multimodal alignment capabilities. However, current two-stage frameworks--which separately optimize audio-visual correspondence and conditional audio synthesis--fundamentally limit joint modeling of dynamic acoustic properties. In this paper, we propose a novel framework for generating video soundtracks that simultaneously produces music and sound effect tailored to the video content. Our method incorporates a Contrastive Visual-Sound-Music pretraining process that maps these modalities into a unified feature space, enhancing the model's ability to capture intricate audio dynamics. We design Spectrum Divergence Masked Attention for Unet to differentiate between the unique characteristics of sound effect and music. We utilize Score-guided Noise Iterative Optimization to provide musicians with customizable control during the generation process. Extensive evaluations on the FilmScoreDB and SymMV&HIMV datasets demonstrate that our approach significantly outperforms state-of-the-art baselines in both subjective and objective assessments, highlighting its potential as a robust tool for video soundtrack generation.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Qi_2025_CVPR, author = {Qi, Fan and Ma, Kunsheng and Xu, Changsheng}, title = {Customized Condition Controllable Generation for Video Soundtrack}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {23914-23924} }