-
[pdf]
[supp]
[bibtex]@InProceedings{Gan_2026_CVPR, author = {Gan, Yulu and Zhao, Kaiya Ivy and Poggio, Tomaso and Isola, Phillip}, title = {Seeing Helps Reasoning in Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings}, month = {June}, year = {2026}, pages = {7080-7090} }
Seeing Helps Reasoning in Language Models
Abstract
Multimodal language models can process both images and texts, yet existing studies often find that naively incorporating visual information fails to improve, and can even degrade the performance of the language model. As a result, the language model backbone is usually kept fixed in multimodal setups.Nevertheless, vision and language are the two primary ways through which humans perceive and understand the world. A large language model (LLM) trained solely on text lacks direct grounding in the physical world, suggesting that, if integrated properly, visual input should enhance rather than harm its perceptual and representational capacities.However, how to integrate vision information so that it benefits LLMs remains an open challenge. In this paper, we propose Cross-Modal Alignment Regularization (CMAR), a method designed to improve LLMs by aligning their internal representations with those of vision models during training. Specifically, in addition to the standard next-token prediction objective, we introduce an alignment objective: the language model is trained to make its internal representations consistent with those of a pretrained vision model. This is achieved using an extra paired image-text dataset, where the text is fed to the language model and the image to the vision model to get language and vision representations. We use popular alignment measures to calculate the alignment score, and the model is encouraged to maximize this score, thereby bringing the internal representations of the language and vision models closer together. Experimental results demonstrate that CMAR consistently improves language models in both pre-training and fine-tuning settings for various model families, downstream tasks, and alignment measures.
Related Material

