Assessing and Learning Alignment of Unimodal Vision and Language Models

Le Zhang, Qian Yang, Aishwarya Agrawal; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 14604-14614

Abstract


How well are unimodal vision and language models aligned? While prior work has explored this question, their assessment methods do not directly translate to practical vision-language tasks. In this paper, we propose a direct assessment method, inspired by linear probing, to evaluate vision-language alignment. We identify that the degree of alignment of SSL vision models depends on their SSL training objective and find that clustering quality of SSL representations impacts alignment performance more than their linear separability. We then introduce Swift Alignment of Image and Language (SAIL), an efficient transfer learning framework that aligns pretrained unimodal vision and language models for downstream tasks. SAIL requires significantly less paired image-text data ( 6%) compared to models like CLIP, which are trained from scratch. It trains with a single A100 GPU in 5 hours and supports a batch size of up to 32,768. SAIL achieves 73.4% zero-shot accuracy on ImageNet (compared to CLIP's 72.7%) and excels in zero-shot retrieval, complex reasoning, and semantic segmentation. SAIL also enhances the language-compatibility of vision encoders, improving the performance of multimodal large language models. The full codebase and model weights are open-source: https://lezhang7.github.io/sail.github.io/

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Zhang_2025_CVPR, author = {Zhang, Le and Yang, Qian and Agrawal, Aishwarya}, title = {Assessing and Learning Alignment of Unimodal Vision and Language Models}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {14604-14614} }