Encyclopedic VQA: Visual Questions About Detailed Properties of Fine-Grained Categories

Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, André Araujo, Vittorio Ferrari; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3113-3124

Abstract


We propose Encyclopedic-VQA, a large scale visual question answering (VQA) dataset featuring visual questions about detailed properties of fine-grained categories and instances. It contains 221k unique question+answer pairs each matched with (up to) 5 images, resulting in a total of 1M VQA samples. Moreover, our dataset comes with a controlled knowledge base derived from Wikipedia, marking the evidence to support each answer. Empirically, we show that our dataset poses a hard challenge for large vision+language models as they perform poorly on our dataset: PaLI [9] is state-of-the-art on OK-VQA [29], yet it only achieves 13.0% accuracy on our dataset. Moreover, we experimentally show that progress on answering our encyclopedic questions can be achieved by augmenting large models with a mechanism that retrieves relevant information for the knowledge base. An oracle experiment with perfect retrieval achieves 87.0% accuracy on the single-hop portion of our dataset, and an automatic retrieval- augmented prototype yields 48.8%. We believe that our dataset enables future research on retrieval-augmented vision+language models.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Mensink_2023_ICCV, author = {Mensink, Thomas and Uijlings, Jasper and Castrejon, Lluis and Goel, Arushi and Cadar, Felipe and Zhou, Howard and Sha, Fei and Araujo, Andr\'e and Ferrari, Vittorio}, title = {Encyclopedic VQA: Visual Questions About Detailed Properties of Fine-Grained Categories}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {3113-3124} }