VisionMath: Vision-Form Mathematical Problem-Solving

Ma, Zongyang; Chen, Yuxin; Zhang, Ziqi; Qi, Zhongang; Yuan, Chunfeng; Zhu, Shaojie; Zhuo, Chengxiang; Li, Bing; Liu, Ye; Li, Zang; Shan, Ying; Hu, Weiming

Zongyang Ma, Yuxin Chen, Ziqi Zhang, Zhongang Qi, Chunfeng Yuan, Shaojie Zhu, Chengxiang Zhuo, Bing Li, Ye Liu, Zang Li, Ying Shan, Weiming Hu; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 1162-1172

Abstract

Mathematical problems in real-world scenarios are often presented in a purely vision-form, where textual problem statement and accompanying math figures, e.g., geometry figures and functional graphs, are integrated into a single image. This vision-form problem-solving task requires precise comprehension and reasoning on both textual and graphical elements in the images, posing significant challenge to current Multimodal Large Language Models (MLLMs), which process text and math figures in isolation. In this work, we propose VisionMath, the first exploration for vision-form mathematical problem-solving model, which employs a three-stage progressive multimodal reasoning alignment strategy to systematically enhance task-specific capabilities. Building upon a LLM proficient in unimodal mathematical reasoning, VisionMath first establishes foundational OCR capabilities through capturing rendered mathematical problem images. Subsequently, the model develops comprehensive understanding of figure structures and properties via learning from figure descriptions and mathematical educational videos. Finally, the model's reasoning capacity is activated using carefully constructed visual-form problem-solving datasets VisionMath-IT with chain-of-thought annotations. For comprehensive evaluation, we construct multilingual benchmarks covering diverse problem types, including geometry, algebra, function problems in both English and Chinese. Our model weights, data and code will be made available at https://github.com/mengqiDyangge/VisionMath.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Ma_2025_ICCV, author = {Ma, Zongyang and Chen, Yuxin and Zhang, Ziqi and Qi, Zhongang and Yuan, Chunfeng and Zhu, Shaojie and Zhuo, Chengxiang and Li, Bing and Liu, Ye and Li, Zang and Shan, Ying and Hu, Weiming}, title = {VisionMath: Vision-Form Mathematical Problem-Solving}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {1162-1172} }