Benchmarking Out-of-Distribution Detection in Visual Question Answering

Xiangxi Shi, Stefan Lee; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 5485-5495

Abstract


When faced with an out-of-distribution (OOD) question or image, visual question answering (VQA) systems may provide unreliable answers. If relied on by real users or secondary systems, these failures may range from annoying to potentially endangering. Detecting OOD samples in single-modality settings is well-studied; however, limited attention has been paid to vision-and-language settings. In this work, we examine the question of OOD detection in the multimodal VQA task and benchmark a suite of approaches to identify OOD image-question pairs. In our experiments, we leverage popular VQA datasets to benchmark detection performance across a range of difficulties. We also produce composite datasets to examine impacts of individual modalities and of image-question agreement. Our results show that answer confidence alone is often a poor signal and that methods based on image-based question generation or examining model attention can lead to significantly better results. We find detecting ungrounded image-question pairs and small shifts in image distribution remain challenging.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Shi_2024_WACV, author = {Shi, Xiangxi and Lee, Stefan}, title = {Benchmarking Out-of-Distribution Detection in Visual Question Answering}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {5485-5495} }