An Approach to Complex Visual Data Interpretation with Vision-Language Models

Thanh-Son Nguyen, Viet-Tham Huynh, Van-Loc Nguyen, Minh-Triet Tran; Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops, 2024, pp. 334-350

Abstract


The LAVA Workshop 2024 challenge aimed to assess the capability of Large Vision-Language Models (VLMs) to interpret and understand complex visual data accurately. This includes intricate visual formats such as data flow diagrams, class diagrams, Gantt charts, and architectural blueprints. In response to this challenge, our research focuses on adapting the MMMU (Massive Multi-discipline Multimodal Understanding) benchmarks to better align with the requirements of visual data interpretation. We propose a comprehensive approach that leverages advanced prompt engineering techniques and incorporates a voting-based ensemble method for aggregating model predictions. This method improves the model's ability to generalize across different types of visual inputs. Our approach was rigorously evaluated within the context of the challenge, resulting in a total score of 0.85, ultimately securing the top position in the competition. This result demonstrates the effectiveness of combining prompt engineering with simple yet powerful ensemble strategies for enhancing the performance of VLMs on complex multimodal tasks.

Related Material


[pdf]
[bibtex]
@InProceedings{Nguyen_2024_ACCV, author = {Nguyen, Thanh-Son and Huynh, Viet-Tham and Nguyen, Van-Loc and Tran, Minh-Triet}, title = {An Approach to Complex Visual Data Interpretation with Vision-Language Models}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV) Workshops}, month = {December}, year = {2024}, pages = {334-350} }