-
[pdf]
[arXiv]
[bibtex]@InProceedings{Shu_2025_ICCV, author = {Shu, Fangxun and Zhang, Lei and Jiang, Hao and Xie, Cihang}, title = {Audio-Visual LLM for Video Understanding}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {4246-4255} }
Audio-Visual LLM for Video Understanding
Abstract
This paper introduces Audio-Visual LLM, a novel Multimodal Large Language Model designed for holistic video understanding through integrated visual and auditory inputs. Our work innovates with a modality-augmented training approach, using uniquely designed modality-specific tokens to selectively activate the corresponding visual and auditory encoders. This mechanism is pivotal in efficient end-to-end training across diverse video data modalities, encompassing visual-only, audio-only, and combined audio-visual content. Additionally, we introduce a high-quality video instruction dataset, characterized by its robust temporal audio-visual correlations, which facilitates the model's adept handling of a wide range of audio-visual tasks, from nuanced audio-visual narratives to intricate reasoning. Extensive experiments demonstrate impressive zero-shot performance in various video understanding tasks, such as question answering, captioning, and complex reasoning, underscoring its potential in video understanding.
Related Material
