-
[pdf]
[bibtex]@InProceedings{Tran_2024_ACCV, author = {Tran, Gia-Nghia and Luu, Duc-Tuan and Thin, and Dang-Van}, title = {Exploring Visual Multiple-Choice Question Answering with Pre-trained Vision-Language Models}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops}, month = {December}, year = {2024}, pages = {319-333} }
Exploring Visual Multiple-Choice Question Answering with Pre-trained Vision-Language Models
Abstract
Visual question answering is a challenging task in computer vision and natural language processing that involves answering questions about an image using both visual and textual information. This task is more challenging when it comes to the Japanese language since there is a lack of research focus on Japanese compared to extensive studies for English and other languages. The ACCV Workshop on Large Vision - Language Model Learning and Applications (LAVA) has organized an interesting challenge that aims at benchmarking different systems on the multiple-choice visual question answering task across both Japanese and English. In this paper, we present a simple yet effective approach that competes in this LAVA Workshop Challenge. To provide a correct answer, our proposed framework needs to (1) Identify entities and understand the visual concepts and the underlying spatial relations in the image referred to in the question, (2) Align the multimedia representations of the visual content with the multiple-choice answers to determine the most accurate response. We believe that the size of the vision-language model affects the overall performance of the proposed system.
Related Material