DIEM: Decomposition-Integration Enhancing Multimodal Insights

Jiang, Xinyi; Wang, Guoming; Guo, Junhao; Li, Juncheng; Zhang, Wenqiao; Lu, Rongxing; Tang, Siliang

Xinyi Jiang, Guoming Wang, Junhao Guo, Juncheng Li, Wenqiao Zhang, Rongxing Lu, Siliang Tang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 27304-27313

Abstract

In image question answering due to the abundant and sometimes redundant information precisely matching and integrating the information from both text and images is a challenge. In this paper we propose the Decomposition-Integration Enhancing Multimodal Insight (DIEM) which initially decomposes the given question and image into multiple subquestions and several sub-images aiming to isolate specific elements for more focused analysis. We then integrate these sub-elements by matching each subquestion with its relevant sub-images while also retaining the original image to construct a comprehensive answer to the original question without losing sight of the overall context. This strategy mirrors the human cognitive process of simplifying complex problems into smaller components for individual analysis followed by an integration of these insights. We implement DIEM on the LLaVA-v1.5 model and evaluate its performance on ScienceQA and MM-Vet. Experimental results indicate that our method boosts accuracy in most question classes of the ScienceQA (+2.03% in average) especially in the image modality (+3.40%). On MM-Vet our method achieves an improvement in MM-Vet scores increasing from 31.1 to 32.4. These findings highlight DIEM's effectiveness in harmonizing the complexities of multimodal data demonstrating its ability to enhance accuracy and depth in image question answering through its decomposition-integration process.

Related Material

[pdf]

[bibtex]

@InProceedings{Jiang_2024_CVPR, author = {Jiang, Xinyi and Wang, Guoming and Guo, Junhao and Li, Juncheng and Zhang, Wenqiao and Lu, Rongxing and Tang, Siliang}, title = {DIEM: Decomposition-Integration Enhancing Multimodal Insights}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {27304-27313} }