Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation

Jingyun Wang, Guoliang Kang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 4102-4112

Abstract


Recent works utilize CLIP to perform the challenging unsupervised semantic segmentation task where only images without annotations are available. However we observe that when adopting CLIP to such a pixel-level understanding task unexpected bias occurs. Previous works don't explicitly model such bias which largely constrains the segmentation performance. In this paper we propose to explicitly model and rectify the bias existing in CLIP to facilitate the unsupervised semantic segmentation. Specifically we design a learnable "Reference" prompt to encode class-preference bias and project the positional embedding of vision transformer to represent space-preference bias. Via a simple element-wise subtraction we rectify the logits of CLIP classifier. Based on the rectified logits we generate a segmentation mask via a Gumbel-Softmax operation. Then a contrastive loss between masked visual feature and the text features of different classes is imposed to facilitate the effective bias modeling. To further improve the segmentation we distill the knowledge from the rectified CLIP to the advanced segmentation architecture via minimizing our designed mask-guided feature-guided and text-guided loss terms. Extensive experiments on standard benchmarks demonstrate that our method performs favorably against previous state-of-the-arts. The implementation is available at https://github.com/dogehhh/ReCLIP.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Wang_2024_CVPR, author = {Wang, Jingyun and Kang, Guoliang}, title = {Learn to Rectify the Bias of CLIP for Unsupervised Semantic Segmentation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {4102-4112} }