DPA: Dual Prototypes Alignment for Unsupervised Adaptation of Vision-Language Models

Eman Ali, Sathira Silva, Muhammad Haris Khan; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 6083-6093

Abstract


Vision-language models (VLMs) e.g. CLIP have shown remarkable potential in zero-shot image classification. However adapting these models to new domains remains challenging especially in unsupervised settings where labeled data is unavailable. Recent research has proposed pseudo-labeling approaches to adapt CLIP in an unsupervised manner using unlabeled target data. Nonetheless these methods struggle due to noisy pseudo-labels resulting from the misalignment between CLIP's visual and textual representations. This study introduces DPA an unsupervised domain adaptation method for VLMs. DPA introduces the concept of dual prototypes acting as distinct classifiers along with the convex combination of their outputs thereby leading to accurate pseudo-label construction. Next it ranks pseudo-labels to facilitate robust self-training particularly during early training. Finally it addresses visual-textual misalignment by aligning textual prototypes with image prototypes to further improve the adaptation performance. Experiments on 13 downstream vision tasks demonstrate that DPA significantly outperforms zero-shot CLIP and the state-of-the-art unsupervised adaptation baselines.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Ali_2025_WACV, author = {Ali, Eman and Silva, Sathira and Khan, Muhammad Haris}, title = {DPA: Dual Prototypes Alignment for Unsupervised Adaptation of Vision-Language Models}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {6083-6093} }