Multimodal Generalized Category Discovery

Su, Yuchang; Zhou, Renping; Huang, Siyu; Li, Xingjian; Wang, Tianyang; Wang, Ziyue; Xu, Min

Yuchang Su, Renping Zhou, Siyu Huang, Xingjian Li, Tianyang Wang, Ziyue Wang, Min Xu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 1634-1643

Abstract

Generalized Category Discovery (GCD) aims to classify inputs into both known and novel categories, a task crucial for open-world scientific discoveries. However, current GCD methods are limited to unimodal data, overlooking the inherently multimodal nature of most real-world data. In this work, we extend GCD to a multimodal setting, where inputs from different modalities provide richer and complementary information. Through theoretical analysis and empirical validation, we identify that the key challenge in multimodal GCD lies in effectively aligning heterogeneous information across modalities. To address this, we propose MM-GCD, a novel framework that aligns both the feature and output spaces of different modalities using contrastive learning and distillation techniques. MM-GCD achieves new state-of-the-art performance on the UPMC-Food101 and N24News datasets, surpassing previous methods by 11.5% and 4.7%, respectively.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Su_2025_CVPR, author = {Su, Yuchang and Zhou, Renping and Huang, Siyu and Li, Xingjian and Wang, Tianyang and Wang, Ziyue and Xu, Min}, title = {Multimodal Generalized Category Discovery}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {1634-1643} }