EgoSG: Learning 3D Scene Graphs from Egocentric RGB-D Sequences

Chaoyi Zhang, Xitong Yang, Ji Hou, Kris Kitani, Weidong Cai, Fu-Jen Chu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 2535-2545

Abstract


Constructing a 3D scene graph of an environment is essential for agents and smart glasses assistants to develop an understanding of their surroundings and predict relationships between various entities within it. 3D Scene Graph Prediction (3DSGP) is commonly adopted to predict the spatial and semantic relationships between objects in a 3D environment reconstructed from posed (calibrated) RGB-D sequences such as object containment or adjacency. However reconstructing a scene can be time-consuming and computationally intensive and requires specialized hardware like IMUs for accurate poses. The reliance on (1) robust algorithms and (2) accurate camera poses limits its applicability. Unlike existing 3DSGP methods we propose to perform perception and reasoning on each frame without assuming available camera poses which we call EgoSG to estimate 3D scene graphs directly from egocentric frame sequences. In our method per-frame instance features are acquired from a partial (per-frame) point cloud. By globally optimizing per-frame features object instances are then associated across the egocentric frames and graph representations are aggregated for 3D scene graph prediction. Compared to the state-of-the-art methods that heavily rely on 3D reconstruction our approach is reconstruction-free and can be derived directly from unposed RGB-D sequences. We benchmark our EgoSG framework against existing reconstruction-based approaches on 3DSGP tasks. Our method outperforms the state-of-the-art methods by a large margin achieving +44.63 R@1 in Object and +22.74 R@1 in Predicate from egocentric sequences without any reliance on reconstruction algorithms or camera poses.

Related Material


[pdf]
[bibtex]
@InProceedings{Zhang_2024_CVPR, author = {Zhang, Chaoyi and Yang, Xitong and Hou, Ji and Kitani, Kris and Cai, Weidong and Chu, Fu-Jen}, title = {EgoSG: Learning 3D Scene Graphs from Egocentric RGB-D Sequences}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {2535-2545} }