Audio-Visual Semantic Graph Network for Audio-Visual Event Localization

Liang Liu, Shuaiyong Li, Yongqiang Zhu; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 23957-23966

Abstract


Audio-visual event localization (AVEL) aims to identify both the category and temporal boundaries of events that are both audible and visible in unconstrained videos. However, the inherent semantic gap between heterogeneous modalities often leads to semantic inconsistency. In this paper, we propose a novel Audio-Visual Semantic Graph Network (AVSGN) to facilitate cross-modal alignment and cross-temporal interaction. Unlike previous approaches (e.g., audio-guided, visual-guided, or both), we introduce shared semantic textual labels to bridge the semantic gap between audio and visual modalities. Specifically, we present a cross-modal semantic alignment (CMSA) module to explore the complementary relationships across heterogeneous modalities (i.e., visual, audio, and text), promoting the convergence of multimodal distributions into a unified semantic space. Additionally, in order to capture cross-temporal dependencies sufficiently, we devise a cross-modal graph interaction (CMGI) module which disentangles complicated interactions across modalities into three complementary subgraphs. Extensive experiments on the AVE dataset comprehensively demonstrate the superiority and effectiveness of the proposed model in both fully- and weakly-supervised AVE settings.

Related Material


[pdf]
[bibtex]
@InProceedings{Liu_2025_CVPR, author = {Liu, Liang and Li, Shuaiyong and Zhu, Yongqiang}, title = {Audio-Visual Semantic Graph Network for Audio-Visual Event Localization}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {23957-23966} }