Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding

Atharv Mahesh Mane, Dulanga Weerakoon, Vigneshwaran Subbaraju, Sougata Sen, Sanjay E. Sarma, Archan Misra; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 9017-9026

Abstract


3-Dimensional Embodied Reference Understanding (3DERU) combines a language description and an accompanying pointing gesture to identify the most relevant target object in a 3D scene. Although prior work has explored pure language-based 3D grounding, there has been limited exploration of 3D-ERU, which also incorporates human pointing gestures. To address this gap, we introduce a data augmentation framework-Imputer, and use it to curate a new benchmark dataset-ImputeRefer for 3D-ERU, by incorporating human pointing gestures into existing 3D scene datasets that only contain language instructions. We also propose Ges3ViG, a novel model for 3D-ERU that achieves 30% improvement in accuracy as compared to other 3DERU models and 9% compared to other purely language-based 3D grounding models. Our code and dataset are available at https://github.com/AtharvMane/Ges3ViG.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Mane_2025_CVPR, author = {Mane, Atharv Mahesh and Weerakoon, Dulanga and Subbaraju, Vigneshwaran and Sen, Sougata and Sarma, Sanjay E. and Misra, Archan}, title = {Ges3ViG : Incorporating Pointing Gestures into Language-Based 3D Visual Grounding for Embodied Reference Understanding}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {9017-9026} }