VGGSounder: Audio-Visual Evaluations for Foundation Models

Zverev, Daniil; Wiedemer, Thaddäus; Prabhu, Ameya; Bethge, Matthias; Brendel, Wieland; Koepke, A. Sophia

Daniil Zverev, Thaddäus Wiedemer, Ameya Prabhu, Matthias Bethge, Wieland Brendel, A. Sophia Koepke; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 1027-1037

Abstract

Designing effective foundation models requires high-quality evaluation datasets. With the emergence of audio-visual foundation models, reliable assessment of their multi-modal understanding is essential. The current gold standard for evaluating audio-visual understanding is the popular classification dataset VGGSound. However, our analysis identifies several critical issues in VGGSound, including incomplete labelling, partially overlapping classes, and misaligned modalities. These flaws lead to distorted evaluations of models' true auditory and visual capabilities. To address these limitations, we introduce VGGSounder, a comprehensively re-annotated, multi-label test set extending VGGSound that is explicitly designed to accurately evaluate audio-visual foundation models. VGGSounder features detailed modality annotations, enabling precise analyses of modality-specific performance and revealing previously unnoticed model limitations. We believe VGGSounder offers a robust and reliable benchmark supporting the future development of audio-visual foundation models.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Zverev_2025_ICCV, author = {Zverev, Daniil and Wiedemer, Thadd\"aus and Prabhu, Ameya and Bethge, Matthias and Brendel, Wieland and Koepke, A. Sophia}, title = {VGGSounder: Audio-Visual Evaluations for Foundation Models}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {1027-1037} }