Autonomous Multimodal Reasoning via Implicit Chain-of-Vision

Yiqiao Huang, Qi He, Zhaorun Chen, Haopeng Zhang, Hanchao Yu, Zhuokai Zhao; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops, 2025, pp. 2963-2972

Abstract


While large vision-language models (LVLMs) have made significant progress in multimodal reasoning, they continue to struggle with complex tasks requiring multi-step reasoning involving different visual cues across reasoning stages. Specifically, LVLMs have difficulty focusing on critical image regions, limiting their ability to solve challenging multimodal algorithmic problems. To address this limitation, we propose Implicit Chain-of-Vision (ICoV), a fine-tuning framework that empowers LVLMs to autonomously generate implicit rationales directly from visual inputs, improving reasoning capabilities without external supervision. Specifically, ICoV utilizes a step-by-step, decoupled training-inference framework, allowing the models to effectively integrate structured logical reasoning with targeted attention on essential visual regions during question answering. Experimental results demonstrate that ICoV significantly enhances LVLM performance on complex multimodal tasks, outperforming both standard fine-tuning methods and existing chain-of-vision (CoV)-based decoding approaches.

Related Material


[pdf]
[bibtex]
@InProceedings{Huang_2025_CVPR, author = {Huang, Yiqiao and He, Qi and Chen, Zhaorun and Zhang, Haopeng and Yu, Hanchao and Zhao, Zhuokai}, title = {Autonomous Multimodal Reasoning via Implicit Chain-of-Vision}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops}, month = {June}, year = {2025}, pages = {2963-2972} }