OV3D-CG: Open-vocabulary 3D Instance Segmentation with Contextual Guidance

Mingquan Zhou, Chen He, Ruiping Wang, Xilin Chen; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 5305-5314

Abstract


Open-vocabulary 3D instance segmentation (OV-3DIS), which aims to segment and classify objects beyond predefined categories, is a critical capability for embodied AI applications. Existing methods rely on pre-trained 2D foundation models, focusing on instance-level features while overlooking contextual relationships, limiting their ability to generalize to rare or ambiguous objects. To address these limitations, we propose an OV-3DIS framework guided by contextual information. First, we employ a Class-agnostic Proposal Module, integrating a pre-trained 3D segmentation model with a SAM-guided segmenter to extract robust 3D instance masks. Subsequently, we design a Semantic Reasoning Module, which selects the best viewpoint for each instance and constructs three 2D context-aware representations. The representations are processed using Multimodal Large Language Models with Chain-of-Thought prompting to enhance semantic inference. Notably, our method outperforms state-of-the-art methods on the ScanNet200 and Replica datasets, demonstrating superior open-vocabulary segmentation capabilities. Moreover, preliminary implementation in real-world scenarios verifies our method's robustness and accuracy, highlighting its potential for embodied AI tasks such as object-driven navigation. Our project page is at: https://vipl-vsu.github.io/OV3D-CG/.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Zhou_2025_ICCV, author = {Zhou, Mingquan and He, Chen and Wang, Ruiping and Chen, Xilin}, title = {OV3D-CG: Open-vocabulary 3D Instance Segmentation with Contextual Guidance}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {5305-5314} }