ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes

Ahmed Abdelreheem, Kyle Olszewski, Hsin-Ying Lee, Peter Wonka, Panos Achlioptas; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 3524-3534

Abstract


The two popular datasets ScanRefer [20] and ReferIt3D [5] connect natural language to real-world 3D scenes. In this paper, we curate a complementary dataset extending both the aforementioned ones. We associate all objects mentioned in a referential sentence with their underlying instances inside a 3D scene. In contrast, previous work did this only for a single object per sentence. Our Scan Entities in 3D (ScanEnts3D) dataset provides explicit cor- respondences between 369k objects across 84k referential sentences, covering 705 real-world scenes. We propose novel architecture modifications and losses that enable learning from this new type of data and improve the performance for both neural listening and language generation. For neu- ral listening, we improve the SoTA in both the Nr3D and ScanRefer benchmarks by 4.3% and 5.0%, respectively. For language generation, we improve the SoTA by 13.2 CIDEr points on the Nr3D benchmark. For both of these tasks, the new type of data is only used to improve training, but no additional annotations are required at inference time. Our introduced dataset is available on the project's webpage at https://scanents3d.github.io/.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Abdelreheem_2024_WACV, author = {Abdelreheem, Ahmed and Olszewski, Kyle and Lee, Hsin-Ying and Wonka, Peter and Achlioptas, Panos}, title = {ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {3524-3534} }