Exploiting Visual Context Semantics for Sound Source Localization

Xinchi Zhou, Dongzhan Zhou, Di Hu, Hang Zhou, Wanli Ouyang; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 5199-5208

Abstract


Self-supervised sound source localization in unconstrained visual scenes is an important task of audio-visual learning. In this paper, we propose a visual reasoning module to explicitly exploit the rich visual context semantics, which alleviates the issue of insufficient utilization of visual information in previous works. The learning objectives are carefully designed to provide stronger supervision signals for the extracted visual semantics while enhancing the audio-visual interactions, which lead to more robust feature representations. Extensive experimental results demonstrate that our approach significantly boosts the localization performances on various datasets, even without initializations pretrained on ImageNet. Moreover, with the visual context exploitation, our framework can accomplish both the audio-visual and purely visual inference, which expands the application scope of the sound source localization task and further raises the competitiveness of our approach.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Zhou_2023_WACV, author = {Zhou, Xinchi and Zhou, Dongzhan and Hu, Di and Zhou, Hang and Ouyang, Wanli}, title = {Exploiting Visual Context Semantics for Sound Source Localization}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2023}, pages = {5199-5208} }