Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition

Juncheng Wang, Chao Xu, Cheng Yu, Lei Shang, Zhe Hu, Shujun Wang, Liefeng Bo; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 3111-3120

Abstract


Video-to-audio generation is essential for synthesizing realistic audio tracks that synchronize effectively with silent videos.Following the perspective of extracting essential signals from videos that can precisely control the mature text-to-audio generative diffusion models, this paper presents how to balance the representation of mel-spectrograms in terms of completeness and complexity through a new approach called Mel Quantization-Continuum Decomposition (Mel-QCD).We decompose the mel-spectrogram into three distinct types of signals, employing quantization or continuity to them, we can effectively predict them from video by a devised video-to-all (V2X) predictor.Then, the predicted signals are recomposed and fed into a ControlNet, along with a textual inversion design, to control the audio generation process.Our proposed Mel-QCD method demonstrates state-of-the-art performance across eight metrics, evaluating dimensions such as quality, synchronization, and semantic consistency.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Wang_2025_CVPR, author = {Wang, Juncheng and Xu, Chao and Yu, Cheng and Shang, Lei and Hu, Zhe and Wang, Shujun and Bo, Liefeng}, title = {Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {3111-3120} }