Autonomous Multimodal Reasoning via Implicit Chain-of-Vision

Huang, Yiqiao; He, Qi; Chen, Zhaorun; Zhang, Haopeng; Yu, Hanchao; Zhao, Zhuokai

Yiqiao Huang, Qi He, Zhaorun Chen, Haopeng Zhang, Hanchao Yu, Zhuokai Zhao; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 2988-2997

Abstract

While large vision-language models (LVLMs) have made significant progress in multimodal reasoning, they continue to struggle with complex tasks requiring multi-step reasoning involving different visual cues across reasoning stages. Specifically, LVLMs have difficulty focusing on critical image regions, limiting their ability to solve challenging multimodal algorithmic problems. To address this limitation, we propose Implicit Chain-of-Vision (ICoV), a fine-tuning framework that empowers LVLMs to autonomously generate implicit rationales directly from visual inputs, improving reasoning capabilities without external supervision. Specifically, ICoV utilizes a step-by-step, decoupled training-inference framework, allowing the models to effectively integrate structured logical reasoning with targeted attention on essential visual regions during question answering. Experimental results demonstrate that ICoV significantly enhances LVLM performance on complex multimodal tasks, outperforming both standard fine-tuning methods and existing chain-of-vision (CoV)-based decoding approaches.

Related Material

[pdf]

[bibtex]

@InProceedings{Huang_2025_CVPR, author = {Huang, Yiqiao and He, Qi and Chen, Zhaorun and Zhang, Haopeng and Yu, Hanchao and Zhao, Zhuokai}, title = {Autonomous Multimodal Reasoning via Implicit Chain-of-Vision}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {2988-2997} }