The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models

Laura Niss, Kevin Vogt-Lowell, Theodoros Tsiligkaridis; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 2396-2406

Abstract


The fine-tuning of large vision-language foundation models remains an underexplored area, particularly regarding its impact on learning gains and catastrophic forgetting. Inspired by the significance of modality gaps in contrastive dual-encoders, we introduce the Inter-Intra Modal Measure (IIMM)--a predictive metric that quantifies the relationship between intra-modal image embedding similarity and inter-modal misalignment. Through extensive empirical analysis across four state-of-the-art vision-language models and five fine-tuning techniques, we establish a strong linear relationship: tasks with higher IIMM scores yield greater in-domain performance improvements but suffer from more pronounced out-of-domain degradation, with some parameter-efficient fine-tuning (PEFT) methods exhibiting severe forgetting. Compared to existing transferability measures, the IIMM demonstrates significantly stronger predictive power for accuracy changes post fine-tuning in dual-encoder models. Moreover, we provide a theoretical bound, proving that changes in IIMM are limited by the Wasserstein distance between pre- and post-fine-tuning embedding distributions, ensuring its stability and robustness as a predictive measure. With only a single forward pass of the target data, practitioners can leverage this key insight to evaluate the degree to which a model can be expected to improve following fine-tuning. When combined with prior knowledge of a model's performance across diverse tasks, the IIMM further enhances transferability predictions for novel tasks, offering a lightweight yet effective tool for guiding model adaptation strategies. Our code is provided at https://github.com/mit-ll/IIMM.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Niss_2025_ICCV, author = {Niss, Laura and Vogt-Lowell, Kevin and Tsiligkaridis, Theodoros}, title = {The Inter-Intra Modal Measure: A Predictive Lens on Fine-Tuning Outcomes in Vision-Language Models}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {2396-2406} }