Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery

Wei He, Xianghan Meng, Zhiyuan Huang, Xianbiao Qi, Rong Xiao, Chun-Guang Li; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 39637-39646

Abstract


Generalized Category Discovery (GCD) aims to identify both known and unknown categories, with only partial labels given for the known categories, posing a challenging open-set recognition problem. State-of-the-art approaches for GCD are usually built on multi-modality representation learning, which pays heavily attention upon inter-modality alignment rather than intra-modality alignment. In this paper, we propose a novel and effective multi-modal representation learning approach for GCD via Semi-Supervised Rate Reduction, called SSR^2-GCD, to learn cross-modality representations with desired underlying structure properties via properly harnessing intra-modality alignment. Moreover, to boost knowledge transfer, we integrate prompt candidates by leveraging the inter-modal alignment offered by Vision Language Models. We conduct extensive experiments on generic and fine-grained benchmark datasets, demonstrating superior performance of the proposed approach and verifying the importance of harnessing an proper intra-modality alignment.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{He_2026_CVPR, author = {He, Wei and Meng, Xianghan and Huang, Zhiyuan and Qi, Xianbiao and Xiao, Rong and Li, Chun-Guang}, title = {Multi-Modal Representation Learning via Semi-Supervised Rate Reduction for Generalized Category Discovery}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {39637-39646} }