Scene-LLM: Extending Language Model for 3D Visual Reasoning

Rao Fu, Jingyu Liu, Xilun Chen, Yixin Nie, Wenhan Xiong; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 2195-2206

Abstract


This paper introduces Scene-LLM a 3D-visual-language model that enhances embodied agents' abilities in interactive 3D indoor environments by integrating the reasoning strengths of Large Language Models (LLMs). Scene-LLM adopts a unified 3D visual feature representation that incorporates dense spatial information and supports scene state updates. The model employs a projection layer to efficiently project these features in the pre-trained textual embedding space enabling effective interpretation of 3D visual information. Unique to our approach is the integration of both scene-level and egocentric 3D information with a compact hybrid representation. This combination is pivotal for interactive planning where scene-level data supports global planning and egocentric data is important for localization. Notably we use egocentric 3D frame features for feature alignment an efficient technique that incorporates the model with fine-grained concepts. Our experiments with Scene-LLM demonstrate its strong capabilities in scene captioning question answering and interactive planning. We believe Scene-LLM advances the field of 3D visual understanding and reasoning offering new possibilities for sophisticated agent interactions in indoor settings.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Fu_2025_WACV, author = {Fu, Rao and Liu, Jingyu and Chen, Xilun and Nie, Yixin and Xiong, Wenhan}, title = {Scene-LLM: Extending Language Model for 3D Visual Reasoning}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {2195-2206} }