Can You Even Tell Left From Right? Presenting a New Challenge for VQA

Sai Raam Venkataraman, Rishi Sridhar Rao, S. Balasubramanian, R. Raghunatha Sarma, Chandra Sekhar Vorugunti; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 4498-4507

Abstract


Visual Question Answering (VQA) needs a means of evaluating the strengths and weaknesses of models. One aspect of such an evaluation is the measurement of compositional generalisation. This relates to the ability of a model to answer well on scenes whose compositions are different from those of scenes in the training dataset. In this work, we present several quantitative measures of compositional separation and find that popular datasets for VQA are not good compositional evaluators. To solve this, we present Uncommon Objects in Unseen Configurations (UOUC), a synthetic dataset for VQA. UOUC is at once fairly complex while also being compositionally well-separated. The object-class of UOUC consists of 380 clasess taken from 528 characters from the Dungeons and Dragons game. The training dataset of UOUC consists of 200,000 scenes; whereas the test set consists of 30,000 scenes. In order to study compositional generalisation, simple reasoning and memorisation, each scene of UOUC is annotated with up to 10 novel questions. These deal with spatial relationships, hypothetical changes to scenes, counting, comparison, memorisation and memory-based reasoning. In total, UOUC presents over 2 million questions. Our evaluation of recent high-performing models for VQA shows that they exhibit poor compositional generalisation, and comparatively lower ability towards simple reasoning. These results suggest that UOUC could lead to advances in research by being a strong benchmark for VQA, especially in the study of compositional generalisation.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Venkataraman_2024_WACV, author = {Venkataraman, Sai Raam and Rao, Rishi Sridhar and Balasubramanian, S. and Sarma, R. Raghunatha and Vorugunti, Chandra Sekhar}, title = {Can You Even Tell Left From Right? Presenting a New Challenge for VQA}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {4498-4507} }