Visual Robustness Benchmark for Visual Question Answering (VQA)

Farhan Ishmam, Ishmam Tashdeed, Talukder Asir Saadat, Hamjajul Ashmafee, Abu Raihan Mostofa Kamal, Azam Hossain; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 6623-6633

Abstract


Can Visual Question Answering (VQA) systems maintain their performance when deployed in the real world? Or are they susceptible to realistic corruption effects e.g. image blur which can be detrimental in sensitive applications such as medical VQA? While linguistic robustness has been thoroughly explored within the VQA literature there has yet to be any significant work on visual robustness. In this work we present the first large-scale benchmark to evaluate the visual robustness of VQA models including multimodal large language models and zero-shot evaluation and assess the strength of the realistic corruption effects. Additionally we have designed several robustness evaluation metrics that quantify an aspect of robustness. These sub-metrics can be aggregated into a unified metric and tailored to fit a variety of use cases. The experiments reveal important insights into the relationships between model size accuracy and robustness against the visual corruptions. Our benchmark highlights the need for a balanced approach in model development that considers model performance without compromising robustness.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Ishmam_2025_WACV, author = {Ishmam, Farhan and Tashdeed, Ishmam and Saadat, Talukder Asir and Ashmafee, Hamjajul and Kamal, Abu Raihan Mostofa and Hossain, Azam}, title = {Visual Robustness Benchmark for Visual Question Answering (VQA)}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {6623-6633} }