ReCLIP: Refine Contrastive Language Image Pre-Training With Source Free Domain Adaptation

Xuefeng Hu, Ke Zhang, Lu Xia, Albert Chen, Jiajia Luo, Yuyin Sun, Ken Wang, Nan Qiao, Xiao Zeng, Min Sun, Cheng-Hao Kuo, Ram Nevatia; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 2994-3003

Abstract


Large-scale pre-training vision-language models (VLM) such as CLIP has demonstrated outstanding performance in zero-shot classification, e.g. achieving 76.3% top-1 accuracy on ImageNet without seeing any example, which leads to potential benefits to many tasks that have no labeled data. However, while applying CLIP to a downstream target domain, the presence of visual and text domain gaps and cross-modality misalignment can greatly impact the model performance. To address such challenges, we propose ReCLIP, a novel source-free domain adaptation method for vision-language models, which does not require any source data or target labeled data. ReCLIP first learns a projection space to mitigate the misaligned visual-text embeddings and learns pseudo labels, and then deploys cross-modality self-training with the pseudo labels, to update visual and text encoders, refine labels and reduce domain gaps and misalignment iteratively. With extensive experiments, we demonstrate that ReCLIP outperforms all the baselines with significant margin and improves the averaged accuracy of CLIP from 69.83% to 74.94% on 22 image classification benchmarks.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Hu_2024_WACV, author = {Hu, Xuefeng and Zhang, Ke and Xia, Lu and Chen, Albert and Luo, Jiajia and Sun, Yuyin and Wang, Ken and Qiao, Nan and Zeng, Xiao and Sun, Min and Kuo, Cheng-Hao and Nevatia, Ram}, title = {ReCLIP: Refine Contrastive Language Image Pre-Training With Source Free Domain Adaptation}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {2994-3003} }