MasQCLIP for Open-Vocabulary Universal Image Segmentation

Xin Xu, Tianyi Xiong, Zheng Ding, Zhuowen Tu; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 887-898


We present a new method for open-vocabulary universal image segmentation, which is capable of performing instance, semantic, and panoptic segmentation under a unified framework. Our approach, called MasQCLIP, seamlessly integrates with a pre-trained CLIP model by utilizing its dense features, thereby circumventing the need for extensive parameter training. MasQCLIP emphasizes two new aspects when building an image segmentation method with a CLIP model: 1) a student-teacher module to deal with masks of the novel (unseen) classes by distilling information from the base (seen) classes; 2) a fine-tuning process to update model parameters for the queries Q within the CLIP model. Thanks to these two simple and intuitive designs, MasQCLIP is able to achieve state-of-the-art performances with a substantial gain over the competing methods by a large margin across all three tasks, including open-vocabulary instance, semantic, and panoptic segmentation. Project page is at

Related Material

@InProceedings{Xu_2023_ICCV, author = {Xu, Xin and Xiong, Tianyi and Ding, Zheng and Tu, Zhuowen}, title = {MasQCLIP for Open-Vocabulary Universal Image Segmentation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {887-898} }