Multi Event Localization by Audio-Visual Fusion With Omnidirectional Camera and Microphone Array

Wenru Zheng, Ryota Yoshihashi, Rei Kawakami, Ikuro Sato, Asako Kanezaki; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2023, pp. 2566-2574

Abstract


Audio-visual fusion is a promising approach for identifying multiple events occurring simultaneously at different locations in the real world. Previous studies on audio-visual event localization (AVE) have been built on datasets that only have monaural or stereo channels in the audio; thus, it was hard to distinguish the direction of audio when different sounds are heard from multiple locations. In this paper, we develop a multi-event localization method using multi-channel audio and omnidirectional images. To take full advantage of the spatial correlation between the features in the two modalities, our method employs early fusion we propose a new fusion method that can retain audio direction and background information in images. We also created a new dataset of multi-label events containing around 660 omnidirectional videos with multi-channel audio, which was used to showcase the effectiveness of the proposed method.

Related Material


[pdf]
[bibtex]
@InProceedings{Zheng_2023_CVPR, author = {Zheng, Wenru and Yoshihashi, Ryota and Kawakami, Rei and Sato, Ikuro and Kanezaki, Asako}, title = {Multi Event Localization by Audio-Visual Fusion With Omnidirectional Camera and Microphone Array}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2023}, pages = {2566-2574} }