Sound3DVDet: 3D Sound Source Detection Using Multiview Microphone Array and RGB Images

Yuhang He, Sangyun Shin, Anoop Cherian, Niki Trigoni, Andrew Markham; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 5496-5507


Spatial localization of 3D sound sources is an important problem in many real world scenarios, especially when the sources may not have any visually distinguishable characteristics; e.g., finding a gas leak, a malfunctioning motor, etc. In this paper, we cast this task in a novel audio-visual setting, by introducing an acoustic-camera rig consisting of a centered pinhole RGB camera and an uniform circular array of four coplanar microphones. Using this setup, we propose Sound3DVDet - a 3D sound source localization Transformer model that takes as input the neural embeddings of the sound signals from the microphones and multiview images (with known poses), and learns to minimize the reprojection error between the predicted locations of the sound sources by the two modalities and the ground truth as the camera moves. When training to minimize this consistency loss, the model learns an implicit association between the audio heard at the microphones and the 3D spatial location in the RGB image, which is sufficient to localize the sources in 3D from a single RGB view. To evaluate our method, we introduce a new dataset: Sound3DVDet Dataset, consisting of nearly 6k scenes produced using the SoundSpaces simulator. We conduct extensive experiments on our dataset and shows the efficacy of our approach against closely related methods, demonstrating significant improvements in the localization accuracy.

Related Material

[pdf] [supp]
@InProceedings{He_2024_WACV, author = {He, Yuhang and Shin, Sangyun and Cherian, Anoop and Trigoni, Niki and Markham, Andrew}, title = {Sound3DVDet: 3D Sound Source Detection Using Multiview Microphone Array and RGB Images}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {5496-5507} }