ICT-QA: Question Answering over Multi-modal Contexts including Image, Chart, and Text Modalities

Jang, Youngrok; Kong, Hyesoo; Kim, Gyeonghun; Lee, Yejin; Choi, Jungkyu; Bae, Kyunghoon

Youngrok Jang, Hyesoo Kong, Gyeonghun Kim, Yejin Lee, Jungkyu Choi, Kyunghoon Bae; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 138-148

Abstract

For question answering in multi-modal contexts that include image, chart, and text modalities, a model must be proficient in understanding each individual modality. Furthermore, the model must be able to find the necessary evidence from multiple modalities and generate answers through cross-modal reasoning for some questions. In this paper, we propose the Image and Chart Instruction Tuning (IC-tuning) method to enhance the model's comprehension of each modality. Specifically, we introduce visual-aware chart instruction-following data that describe both precise numerical values and visual information on the charts. We then train a Large Language Model (LLM) with a model architecture that utilizes an image-specific encoder and a chart-specific encoder. Our experiments demonstrate that this method achieves state-of-the-art performance in Chart Summarization and Open-ended Chart question answering (OpenCQA) tasks while having minimal impact on image and language benchmark performance. Although the IC-tuned model shows great comprehension performance for each modality, it still struggles with question answering tasks in multi-modal contexts because it is only trained on data for understanding each individual modality. To address this, we introduce the Question Answering over Image, Chart, and Text (ICT-QA) dataset, designed specifically for question answering in multi-modal contexts. After further training the IC-tuned LLM with the ICT-QA dataset, our evaluations demonstrate that ICT-QA significantly improves the quality of answers for both single-modal questions, where only one modality needs to be referenced from multiple modalities, and cross-modal questions, which require reasoning across multiple modalities.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Jang_2025_CVPR, author = {Jang, Youngrok and Kong, Hyesoo and Kim, Gyeonghun and Lee, Yejin and Choi, Jungkyu and Bae, Kyunghoon}, title = {ICT-QA: Question Answering over Multi-modal Contexts including Image, Chart, and Text Modalities}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {138-148} }