-
[pdf]
[supp]
[bibtex]@InProceedings{Wang_2025_WACV, author = {Wang, Yimu and Czarnecki, Krzysztof}, title = {AIDE: Improving 3D Open-Vocabulary Semantic Segmentation by Aligned Vision-Language Learning}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {2674-2685} }
AIDE: Improving 3D Open-Vocabulary Semantic Segmentation by Aligned Vision-Language Learning
Abstract
3D open-vocabulary semantic segmentation aims at recognizing countless categories beyond the limited set of annotations used in traditional settings. Due to the lack of large-scale 3D-vision-language segmentation data instead of training models from scratch the current solutions distill knowledge from pre-trained 2D vision-language models (VLMs) into 3D models. However this distillation is supervised by misaligned 3D-scene-image-to-text data pairs consequently leading to suboptimal performance. Moreover as 2D VLMs are trained on 2D datasets text encoders of VLMs which serve as the bridge between 3D models and an unbounded set of categories lack 3D semantics. In this paper to address these issues and improve generalization performance we propose an AlIgned 3D Open-Vocabulary SEmantic Segmentation framework called AIDE with two novel modules. To collect high-quality and well-aligned 3D-scene-image-to-text pairs our CLIP-rewarded alignment module (i) generates diverse captions of multi-view images of 3D scenes to capture details by varying the temperatures and then (ii) samples captions based on their similarity to corresponding images for rich and accurate associations. Next to adapt 2D VLMs to 3D contexts our adaptive segmentation module introduces (iii) trainable tokens within the input space and each layer of the text encoder while freezing the text encoder to avoid catastrophic forgetting. Extensive experiments show that AIDE outperforms previous methods by a large margin on three representative benchmarks demonstrating its effectiveness.
Related Material