Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition

Wang, Juncheng; Xu, Chao; Yu, Cheng; Shang, Lei; Hu, Zhe; Wang, Shujun; Bo, Liefeng

Juncheng Wang, Chao Xu, Cheng Yu, Lei Shang, Zhe Hu, Shujun Wang, Liefeng Bo; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 3111-3120

Abstract

Video-to-audio generation is essential for synthesizing realistic audio tracks that synchronize effectively with silent videos.Following the perspective of extracting essential signals from videos that can precisely control the mature text-to-audio generative diffusion models, this paper presents how to balance the representation of mel-spectrograms in terms of completeness and complexity through a new approach called Mel Quantization-Continuum Decomposition (Mel-QCD).We decompose the mel-spectrogram into three distinct types of signals, employing quantization or continuity to them, we can effectively predict them from video by a devised video-to-all (V2X) predictor.Then, the predicted signals are recomposed and fed into a ControlNet, along with a textual inversion design, to control the audio generation process.Our proposed Mel-QCD method demonstrates state-of-the-art performance across eight metrics, evaluating dimensions such as quality, synchronization, and semantic consistency.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Wang_2025_CVPR, author = {Wang, Juncheng and Xu, Chao and Yu, Cheng and Shang, Lei and Hu, Zhe and Wang, Shujun and Bo, Liefeng}, title = {Synchronized Video-to-Audio Generation via Mel Quantization-Continuum Decomposition}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {3111-3120} }