M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models

Wang, Hongyu; Xu, Jiayu; Xie, Senwei; Wang, Ruiping; Li, Jialin; Xie, Zhaojie; Zhang, Bin; Xiong, Chuyan; Chen, Xilin

Hongyu Wang, Jiayu Xu, Senwei Xie, Ruiping Wang, Jialin Li, Zhaojie Xie, Bin Zhang, Chuyan Xiong, Xilin Chen; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026, pp. 382-392

Abstract

Multilingual capability is a crucial requirement for large multimodal models, which are increasingly deployed across diverse countries and languages. However, most existing benchmarks for multilingual multimodal reasoning fail to effectively distinguish models of different strengths; in fact, even text-only language models without visual capabilities can often achieve high scores. As a result, the comprehensive evaluation of state-of-the-art multilingual multimodal models remains underexplored. In this work, we present M4U, a novel and challenging benchmark designed to evaluate multilingual, multi-discipline multimodal understanding and reasoning. M4U comprises 10k samples spanning 64 disciplines across 16 subfields in Science, Engineering, and Healthcare, covering six languages. Using this benchmark, we conduct extensive evaluations of leading Large Multimodal Models (LMMs) and Large Language Models (LLMs) augmented with external tools. Our results reveal that even the strongest LMMs exhibit pronounced language preferences and struggle with reasoning tasks that require integrating multilingual information across visual and textual modalities. In particular, performance drops markedly when models are prompted with cross-lingual multimodal questions, highlighting significant gaps in current multilingual multimodal reasoning capabilities.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Wang_2026_WACV, author = {Wang, Hongyu and Xu, Jiayu and Xie, Senwei and Wang, Ruiping and Li, Jialin and Xie, Zhaojie and Zhang, Bin and Xiong, Chuyan and Chen, Xilin}, title = {M4U: Evaluating Multilingual Understanding and Reasoning for Large Multimodal Models}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {March}, year = {2026}, pages = {382-392} }