-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Zhang_2026_WACV, author = {Zhang, Miaosen and Dai, Qi and Yang, Yifan and Bao, Jianmin and Chen, Dongdong and Qiu, Kai and Luo, Chong and Geng, Xin and Guo, Baining}, title = {MageBench: Bridging Large Multimodal Models to Agents}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {March}, year = {2026}, pages = {1415-1427} }
MageBench: Bridging Large Multimodal Models to Agents
Abstract
Recent models like OpenAI's O1 and DeepSeek's R1, which utilize test-time scaling techniques, have demonstrated remarkable improvements in reasoning capabilities. We anticipate that in the near future, multimodal models will also experience significant breakthroughs in multimodal reasoning. This will require some highly challenging and specialized evaluations.As one of the most crucial real-world applications of multimodal models, visual agents require complex and comprehensive capabilities such as spatial planning and vision-in-the-chain type reasoning. These capabilities are currently lacking in existing multimodal benchmarks. In this paper, we introduce **MageBench**, a **M**ultimodal reasoning benchmark built upon light-weight **AGE**nt environments that pose significant reasoning challenges and hold substantial practical value. The results show that only a few product-level models are better than random acting, and all of them are far inferior to human level. We analyze and summarize their errors and capability gaps in visual planning.Furthermore, we found that rule-based RL can significantly boost visual reasoning capabilities. This highlights that our benchmark could serve as a valuable testing ground for the emerging field of agentic RL research.
Related Material
