Enhancing Visual Question Answering with Pre-trained Vision-Language Models: An Ensemble Approach at the LAVA Challenge 2024

Trong-Hieu Nguyen-Mau, Nhu-Binh Nguyen Truc, Nhu-Vinh Hoang, Minh-Triet Tran, Hai-Dang Nguyen; Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops, 2024, pp. 275-286

Abstract


The LAVA challenge presents complex visual question answering tasks involving intricate diagrams, each accompanied by multiplechoice questions in English or Japanese. Addressing this challenge, we - the team v1olet - explore the capabilities of pre-trained Large Vision- Language Models to interpret and reason over such sophisticated visual data.We utilize models including Qwen2-VL, InternVL2, MiniCPM, and Llama-3.2-Vision-Instruct, employing a structured prompt template designed to standardize response generation and facilitate step-by-step reasoning. To enhance accuracy and robustness, we implement an ensemble method using majority voting to combine outputs from different models and configurations. Our experimental results demonstrate that the ensemble approach significantly improves performance, achieving a higher public score on the LAVA challenge dataset compared to individual models. Specifically, the ensemble of Qwen2-VL, InternVL2, and Llama-3.2 models attained the highest public score of 82, outperforming the best single model. This study highlights the effectiveness of combining multiple Large Vision-Language Models through ensemble methods and underscores the potential of prompt-based inference in enhancing model reasoning capabilities for complex VQA tasks. The provided code is here.

Related Material


[pdf]
[bibtex]
@InProceedings{Nguyen-Mau_2024_ACCV, author = {Nguyen-Mau, Trong-Hieu and Truc, Nhu-Binh Nguyen and Hoang, Nhu-Vinh and Tran, Minh-Triet and Nguyen, Hai-Dang}, title = {Enhancing Visual Question Answering with Pre-trained Vision-Language Models: An Ensemble Approach at the LAVA Challenge 2024}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops}, month = {December}, year = {2024}, pages = {275-286} }