Can Multimodal Large Language Models Truly Perform Multimodal In-Context Learning?

Chen, Shuo; Han, Zhen; He, Bailan; Liu, Jianzhe; Buckley, Mark; Qin, Yao; Torr, Philip; Tresp, Volker; Gu, Jindong

Shuo Chen, Zhen Han, Bailan He, Jianzhe Liu, Mark Buckley, Yao Qin, Philip Torr, Volker Tresp, Jindong Gu; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 6000-6010

Abstract

Large Language Models (LLMs) with in-context learning (ICL) ability can quickly adapt to a specific context given a few demonstrations (demos). Recently Multimodal Large Language Models (MLLMs) built upon LLMs have also shown multimodal ICL ability i.e. responding to queries given a few multimodal demos including images queries and answers. While ICL has been extensively studied on LLMs its research on MLLMs remains limited. One essential question is whether these MLLMs can truly conduct multimodal ICL or if only the textual modality is necessary. We investigate this question by examining two primary factors that influence ICL: 1) Demo content i.e. understanding the influences of demo content in different modalities. 2) Demo selection strategy i.e. how to select better multimodal demos for improved performance. Experiments revealed that multimodal ICL is predominantly driven by the textual content whereas the visual information in the demos has little influence. Interestingly visual content is still necessary and useful for selecting demos to increase performance. Motivated by our analysis we propose a simple yet effective approach termed Mixed Modality In-Context Example Selection (MMICES) which considers both visual and language modalities when selecting demos. Extensive experiments are conducted to support our findings and verify the improvement brought by our method.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Chen_2025_WACV, author = {Chen, Shuo and Han, Zhen and He, Bailan and Liu, Jianzhe and Buckley, Mark and Qin, Yao and Torr, Philip and Tresp, Volker and Gu, Jindong}, title = {Can Multimodal Large Language Models Truly Perform Multimodal In-Context Learning?}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {6000-6010} }