Building Vision-Language Models on Solid Foundations with Masked Distillation

Sepehr Sameni, Kushal Kafle, Hao Tan, Simon Jenni; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14216-14226

Abstract


Recent advancements in Vision-Language Models (VLMs) have marked a significant leap in bridging the gap between computer vision and natural language processing. However traditional VLMs trained through contrastive learning on limited and noisy image-text pairs often lack the spatial and linguistic understanding to generalize well to dense vision tasks or less common languages. Our approach Solid Foundation CLIP (SF-CLIP) circumvents this issue by implicitly building on the solid visual and language understanding of foundational models trained on vast amounts of unimodal data. SF-CLIP integrates contrastive image-text pretraining with a masked knowledge distillation from large foundational text and vision models. This methodology guides our VLM in developing robust text and image representations. As a result SF-CLIP shows exceptional zero-shot classification accuracy and enhanced image and text retrieval capabilities setting a new state of the art for ViT-B/16 trained on YFCC15M and CC12M. Moreover the dense per-patch supervision enhances our zero-shot and linear probe performance in semantic segmentation tasks. A remarkable aspect of our model is its multilingual proficiency evidenced by strong retrieval results in multiple languages despite being trained predominantly on English data. We achieve all of these improvements without sacrificing the training efficiency through our selective application of masked distillation and the inheritance of teacher word embeddings.

Related Material


[pdf]
[bibtex]
@InProceedings{Sameni_2024_CVPR, author = {Sameni, Sepehr and Kafle, Kushal and Tan, Hao and Jenni, Simon}, title = {Building Vision-Language Models on Solid Foundations with Masked Distillation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {14216-14226} }