Audio-Visual LLM for Video Understanding

Shu, Fangxun; Zhang, Lei; Jiang, Hao; Xie, Cihang

Fangxun Shu, Lei Zhang, Hao Jiang, Cihang Xie; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 4305-4314

Abstract

This paper introduces Audio-Visual LLM, a novel Multimodal Large Language Model designed for holistic video understanding through integrated visual and auditory inputs. Our work innovates with a modality-augmented training approach, using uniquely designed modality-specific tokens to selectively activate the corresponding visual and auditory encoders. This mechanism is pivotal in efficient end-to-end training across diverse video data modalities, encompassing visual-only, audio-only, and combined audio-visual content. Additionally, we introduce a high-quality video instruction dataset, characterized by its robust temporal audio-visual correlations, which facilitates the model's adept handling of a wide range of audio-visual tasks, from nuanced audio-visual narratives to intricate reasoning. Extensive experiments demonstrate impressive zero-shot performance in various video understanding tasks, such as question answering, captioning, and complex reasoning, underscoring its potential in video understanding.

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{Shu_2025_ICCV, author = {Shu, Fangxun and Zhang, Lei and Jiang, Hao and Xie, Cihang}, title = {Audio-Visual LLM for Video Understanding}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {4305-4314} }