What Makes Multimodal In-Context Learning Work?

Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, Benjamin Piwowarski; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 1539-1550

Abstract


Large Language Models have demonstrated remarkable performance across various tasks exhibiting the capacity to swiftly acquire new skills such as through In-Context Learning (ICL) with minimal demonstration examples. In this work we present a comprehensive framework for investigating Multimodal ICL (M-ICL) in the context of Large Multimodal Models. We consider the best open-source multimodal models (e.g. IDEFICS OpenFlamingo) and a wide range of multimodal tasks. Our study unveils several noteworthy findings: (1) M-ICL primarily relies on text-driven mechanisms showing little to no influence from the image modality. (2) When used with advanced-ICL strategy (like RICES) M-ICL is not better than a simple strategy based on majority voting over context examples. Moreover we identify several biases and limitations of M-ICL that warrant consideration prior to deployment. The code will be made publicly available.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Baldassini_2024_CVPR, author = {Baldassini, Folco Bertini and Shukor, Mustafa and Cord, Matthieu and Soulier, Laure and Piwowarski, Benjamin}, title = {What Makes Multimodal In-Context Learning Work?}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {1539-1550} }