GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding

Rui Hu, Lianghui Zhu, Yuxuan Zhang, Tianheng Cheng, Lei Liu, Heng Liu, Longjin Ran, Xiaoxin Chen, Wenyu Liu, Xinggang Wang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 23105-23114

Abstract


Pixel grounding, encompassing tasks such as Referring Expression Segmentation (RES), has garnered considerable attention due to its potential for bridging the gap between vision and language modalities. However, advancements in this domain are currently constrained by limitations inherent in existing datasets, including limited object categories, insufficient textual diversity, and a scarcity of high-quality annotations. To mitigate these limitations, we introduce GroundingSuite, which comprises: (1) an automated data annotation framework leveraging multiple Vision-Language Model (VLM) agents; (2) a large-scale training dataset encompassing 9.56 million diverse referring expressions and their corresponding segmentations; and (3) a meticulously curated evaluation benchmark consisting of 3,800 images. The GroundingSuite dataset boosts model performance to state-of-the-art levels. Specifically, a cIoU of 68.9 on gRefCOCO and a gIoU of 55.3 on RefCOCOm. Moreover, the GroundingSuite annotation framework demonstrates superior efficiency compared to the current leading data annotation method, i.e., 4.5x faster than the GLaMM. Codes are available at: https://github.com/hustvl/GroundingSuite.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Hu_2025_ICCV, author = {Hu, Rui and Zhu, Lianghui and Zhang, Yuxuan and Cheng, Tianheng and Liu, Lei and Liu, Heng and Ran, Longjin and Chen, Xiaoxin and Liu, Wenyu and Wang, Xinggang}, title = {GroundingSuite: Measuring Complex Multi-Granular Pixel Grounding}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {23105-23114} }