GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields

Yunsong Wang, Hanlin Chen, Gim Hee Lee; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 20443-20453

Abstract


Recent advancements in vision-language foundation models have significantly enhanced open-vocabulary 3D scene understanding. However the generalizability of existing methods is constrained due to their framework designs and their reliance on 3D data. We address this limitation by introducing Generalizable Open-Vocabulary Neural Semantic Fields (GOV-NeSF) a novel approach offering a generalizable implicit representation of 3D scenes with open-vocabulary semantics. We aggregate the geometry-aware features using a cost volume and propose a Multi-view Joint Fusion module to aggregate multi-view features through a cross-view attention mechanism which effectively predicts view-specific blending weights for both colors and open-vocabulary features. Remarkably our GOV-NeSF exhibits state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation eliminating the need for ground truth semantic labels or depth priors and effectively generalize across scenes and datasets without fine-tuning.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Wang_2024_CVPR, author = {Wang, Yunsong and Chen, Hanlin and Lee, Gim Hee}, title = {GOV-NeSF: Generalizable Open-Vocabulary Neural Semantic Fields}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {20443-20453} }