VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage

Andrea Alfarano, Lorenzo Venturoli, Dario Negueruela Del Castillo; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 396-406

Abstract


Multimodal Large Language Models (MLLMs) have demonstrated significant capabilities in joint visual and linguistic tasks. However, existing Visual Question Answering (VQA) benchmarks often fail to evaluate deep semantic understanding, particularly in complex domains like visual art analysis. Confined to simple syntactic structures and surface-level attributes, these questions fail to capture the diversity and depth of human visual inquiry. This limitation incentivizes models to exploit statistical shortcuts rather than engage in visual reasoning. To address this gap, we introduce VQArt-Bench, a new, large-scale VQA benchmark for the cultural heritage domain. This benchmark is constructed using a novel multi-agent pipeline where specialized agents collaborate to generate nuanced, validated, and linguistically diverse questions. The resulting benchmark is structured along relevant visual understanding dimensions that probe a model's ability to interpret symbolic meaning, narratives, and complex visual relationships. Ourevaluation of 14 state-of-the-art MLLMs on this benchmark reveals significant limitations in current models, including a surprising weakness in simple counting tasks and a clear performance gap between proprietary and opensource models. Our dataset is available here1

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Alfarano_2025_ICCV, author = {Alfarano, Andrea and Venturoli, Lorenzo and Del Castillo, Dario Negueruela}, title = {VQArt-Bench: A semantically rich VQA Benchmark for Art and Cultural Heritage}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {396-406} }