Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts

Özdemir, Övgü; Akagündüz, Erdem

Övgü Özdemir, Erdem Akagündüz; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 1562-1571

Abstract

Visual question answering (VQA) is known as an AI-complete task as it requires understanding reasoning and inferring about the vision and the language content. Over the past few years numerous neural architectures have been suggested for the VQA problem. However achieving success in zero-shot VQA remains a challenge due to its requirement for advanced generalization and reasoning skills. This study explores the impact of incorporating image captioning as an intermediary process within the VQA pipeline. Specifically we explore the efficacy of utilizing image captions instead of images and leveraging large language models (LLMs) to establish a zero-shot setting. Since image captioning is the most crucial step in this process we compare the impact of state-of-the-art image captioning models on VQA performance across various question types in terms of structure and semantics. We propose a straightforward and efficient question-driven image captioning approach within this pipeline to transfer contextual information into the question-answering (QA) model. This method involves extracting keywords from the question generating a caption for each image-question pair using the keywords and incorporating the question-driven caption into the LLM prompt. We evaluate the efficacy of using general-purpose and question-driven image captions in the VQA pipeline. Our study highlights the potential of employing image captions and harnessing the capabilities of LLMs to achieve competitive performance on GQA under the zero-shot setting. Our code is available at https://github.com/ovguyo/captions-in-VQA.

Related Material

[pdf]

[bibtex]

@InProceedings{Ozdemir_2024_CVPR, author = {\"Ozdemir, \"Ovg\"u and Akag\"und\"uz, Erdem}, title = {Enhancing Visual Question Answering through Question-Driven Image Captions as Prompts}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {1562-1571} }