-
[pdf]
[supp]
[bibtex]@InProceedings{Lu_2026_CVPR, author = {Lu, Weiheng and Yu, An and Li, Jian and Zhang, Zhenfei and Ye, Felix X.-F. and Chang, Ming-Ching}, title = {FAVE: A Structured Benchmark for Fine-Grained Audio-Visual Temporal Evaluation in Multimodal LLMs}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {1651-1660} }
FAVE: A Structured Benchmark for Fine-Grained Audio-Visual Temporal Evaluation in Multimodal LLMs
Abstract
Audio-visual large language models (AVLLMs) have made significant strides in understanding visual and auditory content. However, their ability to capture fine-grained temporal relationships between audio and visual streams remains insufficiently evaluated. To address this, we introduce FAVE (Fine-grained Audio-Visual Temporal Evaluation), a comprehensive benchmark targeting three core dimensions of temporal perception: cross-modal temporal alignment (FAVE-align), event temporal relationship (FAVE-low), and detailed moment captioning (FAVE-high). To construct FAVE, we propose a scalable annotation pipeline that integrates shot boundary detection, automated captioning, and GPT-assisted refinement to produce temporally grounded, high-quality data. Extensive experiments on twelve state-of-the-art multimodal LLMs, both open-source and closed-source, reveal key limitations in multimodal integration, temporal relationship and timestamp localization, especially for joint audio-visual tasks. These findings highlight the need for better temporal modeling to improve AVLLMs' understanding of real-world video content. FAVE serves as a rigorous testbed for advancing temporally aware multimodal systems, and will be publicly released upon acceptance.
Related Material

