-
[pdf]
[supp]
[bibtex]@InProceedings{Li_2024_CVPR, author = {Li, Zhaojian and Zhao, Bin and Yuan, Yuan}, title = {Cyclic Learning for Binaural Audio Generation and Localization}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {26669-26678} }
Cyclic Learning for Binaural Audio Generation and Localization
Abstract
Binaural audio is obtained by simulating the biological structure of human ears which plays an important role in artificial immersive spaces. A promising approach is to utilize mono audio and corresponding vision to synthesize binaural audio thereby avoiding expensive binaural audio recording. However most existing methods directly use the entire scene as a guide ignoring the correspondence between sounds and sounding objects. In this paper we advocate generating binaural audio using fine-grained raw waveform and object-level visual information as guidance. Specifically we propose a Cyclic Locating-and-UPmixing (CLUP) framework that jointly learns visual sounding object localization and binaural audio generation. Visual sounding object localization establishes the correspondence between specific visual objects and sound modalities which provides object-aware guidance to improve binaural generation performance. Meanwhile the spatial information contained in the generated binaural audio can further improve the performance of sounding object localization. In this case visual sounding object localization and binaural audio generation can achieve cyclic learning and benefit from each other. Experimental results demonstrate that on the FAIR-Play benchmark dataset our method is significantly ahead of the existing baselines in multiple evaluation metrics (STFT\downarrow: 0.787 vs. 0.851 ENV\downarrow: 0.128 vs. 0.134 WAV\downarrow: 5.244 vs. 5.684 SNR\uparrow: 7.546 vs. 7.044).
Related Material