LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

Anh-Quan Cao, Maximilian Jaritz, Matthieu Guillaumin, Raoul de Charette, Loris Bazzani; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 5030-5040

Abstract


Large-scale vision-language pre-trained (VLP) models (e.g. CLIP) are renowned for their versatility as they can be applied to diverse applications in a zero-shot setup. However when these models are used in specific domains their performance often falls short due to domain gaps or the under-representation of these domains in the training data. While fine-tuning VLP models on custom datasets with human-annotated labels can address this issue annotating even a small-scale dataset (e.g. 100k samples) can be an expensive endeavor often requiring expert annotators if the task is complex. To address these challenges we propose LatteCLIP an unsupervised method for fine-tuning CLIP models on classification with known class names in custom domains without relying on human annotations. Our method leverages Large Multimodal Models (LMMs) to generate expressive textual descriptions for both individual images and groups of images. These provide additional contextual information to guide the fine-tuning process in the custom domains. Since LMM-generated descriptions are prone to hallucination or missing details we introduce a novel strategy to distill only the useful information and stabilize the training. Specifically we learn rich per-class prototype representations from noisy generated texts and dual pseudo-labels. Our experiments on 10 domain-specific datasets show that LatteCLIP outperforms pre-trained zero-shot methods by an average improvement of +4.74 points in top-1 accuracy and other state-of-the-art unsupervised methods by +3.45 points.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Cao_2025_WACV, author = {Cao, Anh-Quan and Jaritz, Maximilian and Guillaumin, Matthieu and de Charette, Raoul and Bazzani, Loris}, title = {LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {5030-5040} }