Supervising Sound Localization by In-the-wild Egomotion

Anna Min, Ziyang Chen, Hang Zhao, Andrew Owens; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 23936-23946

Abstract


We present a method for learning binaural sound localization from ego-motion in videos. When the camera moves in a video, the direction of sound sources will change along with it. We train an audio model to predict sound directions that are consistent with visual estimates of camera motion, which we obtain using methods from multi-view geometry. This provides a weak but plentiful form of supervision that we combine with traditional binaural cues. To evaluate this idea, we propose a dataset of real-world audio-visual videos with ego-motion. We show that our model can successfully learn from this real-world data, and that it obtains strong performance on sound localization tasks.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Min_2025_CVPR, author = {Min, Anna and Chen, Ziyang and Zhao, Hang and Owens, Andrew}, title = {Supervising Sound Localization by In-the-wild Egomotion}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {23936-23946} }