Event-Specific Audio-Visual Fusion Layers: A Simple and New Perspective on Video Understanding

Arda Senocak, Junsik Kim, Tae-Hyun Oh, Dingzeyu Li, In So Kweon; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 2237-2247

Abstract


To understand our surrounding world, our brain is continuously inundated with multisensory information and their complex interactions coming from the outside world at any given moment. While processing this information might seem effortless for human brains, it is challenging to build a machine that can perform similar tasks since complex interactions cannot be dealt with a single type of integration but require more sophisticated approaches. In this paper, we propose a new simple method to address the multisensory integration in video understanding. Unlike previous works where a single fusion type is used, we design a multi-head model with individual event-specific layers to deal with different audio-visual relationships, enabling different ways of audio-visual fusion. Experimental results show that our event-specific layers can discover unique properties of the audio-visual relationships in the videos, e.g., semantically matched moments, and rhythmic events. Moreover, although our network is trained with single labels, our multi-head design can inherently output additional semantically meaningful multi-labels for a video. As an application, we demonstrate that our proposed method can expose the extent of event-characteristics of popular benchmark datasets.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Senocak_2023_WACV, author = {Senocak, Arda and Kim, Junsik and Oh, Tae-Hyun and Li, Dingzeyu and Kweon, In So}, title = {Event-Specific Audio-Visual Fusion Layers: A Simple and New Perspective on Video Understanding}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2023}, pages = {2237-2247} }