-
[pdf]
[supp]
[bibtex]@InProceedings{He_2025_WACV, author = {He, Yuhang and Shin, Sangyun and Cherian, Anoop and Trigoni, Niki and Markham, Andrew}, title = {SoundLoc3D: Invisible 3D Sound Source Localization and Classification using a Multimodal RGB-D Acoustic Camera}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {5408-5418} }
SoundLoc3D: Invisible 3D Sound Source Localization and Classification using a Multimodal RGB-D Acoustic Camera
Abstract
Accurately localizing 3D sound sources and estimating their semantic labels - where the sources may not be visible but are assumed to lie on the physical surface of objects in the scene - have many real applications including detecting gas leak and machinery malfunction. The audio-visual weak- correlation in such setting poses new challenges in deriving innovative methods to answer if or how we can use cross- modal information to solve the task. Towards this end we propose to use an acoustic-camera rig consisting of a pinhole RGB-D camera and a coplanar four-channel microphone array (Mic-Array). By using this rig to record audio-visual signals from multiviews we can use the cross-modal cues to estimate the sound sources 3D locations. Specifically our framework SoundLoc3D treats the task as a set prediction problem each element in the set corresponds to a potential sound source. Given the audio-visual weak-correlation the set representation is initially learned from a single view microphone array signal and then refined by actively incorporating physical surface cues revealed from multiview RGB-D images. We demonstrate the efficiency and superiority of SoundLoc3D on large-scale simulated dataset and further show its robustness to RGB-D measurement inaccuracy and ambient noise interference.
Related Material