Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate

Huang, Qidong; Dong, Xiaoyi; Zhang, Pan; Zang, Yuhang; Cao, Yuhang; Wang, Jiaqi; Zhang, Weiming; Yu, Nenghai

Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jiaqi Wang, Weiming Zhang, Nenghai Yu; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 218-227

Abstract

The early stage of multi-modal pre-training plays a pivotal role in aligning two modalities for Large Vision-Language Models (LVLMs), while evaluating its training quality usually requires the costly supervised fine-tuning (SFT) stage to verify the downstream benchmark scores. Loss, perplexity, and in-context evaluation results are commonly used pre-training metrics for Large Language Models (LLMs), while we observed that these metrics are less indicative when quantifying the pre-trained LVLMs. Due to the lack of proper metrics, the research of LVLMs in the multi-modal fusion stage is hindered greatly, including the training data choice, efficient module design, etc.In this paper, we first present Modality Integration Rate (MIR), an effective, robust, and generalized metric to indicate the multi-modal alignment quality of LVLMs without SFT. This metric evaluates LVLM pre-training from the inter-modal distribution distance perspective, which is 1) Effective to represent the fusion quality and show a positive relation with the benchmark performance after SFT, 2) Robust toward different training/evaluation data, and 3) Generalize across training configurations and architecture choices. Complementing MIR, we further propose learnable Modality Calibration (MoCa), a lightweight module to narrow the modality gap at each language model layer during training. A series of experiments are conducted to explore the effectiveness of MIR and MoCa, demonstrating that MIR is highly indicative about training data selection, training strategy schedule, and model architecture design to get better pre-training results. The code is avaliable at \href https://github.com/shikiw/Modality-Integration-Rate shikiw/Modality-Integration-Rate

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Huang_2025_ICCV, author = {Huang, Qidong and Dong, Xiaoyi and Zhang, Pan and Zang, Yuhang and Cao, Yuhang and Wang, Jiaqi and Zhang, Weiming and Yu, Nenghai}, title = {Deciphering Cross-Modal Alignment in Large Vision-Language Models via Modality Integration Rate}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {218-227} }