Data-Efficient 3D Visual Grounding via Order-Aware Referring

Tung-Yu Wu, Sheng-Yu Huang, Yu-Chiang Frank Wang; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 3107-3117

Abstract


3D visual grounding aims to identify the target object within a 3D point cloud scene referred to by a natural language description. Previous works usually require significant data relating to point color and their descriptions to exploit the corresponding complicated verbo-visual relations. In our work we introduce Vigor a novel Data-Efficient 3D Visual Grounding framework via Order-aware Referring. Vigor leverages LLM to produce a desirable referential order from the input description for 3D visual grounding. With the proposed stacked object-referring blocks the predicted anchor objects in the above order allow one to locate the target object progressively without supervision on the identities of anchor objects or exact relations between anchor/target objects. We also present an order-aware warm-up training strategy which augments referential orders for pre-training the visual grounding framework allowing us to better capture the complex verbo-visual relations and benefit the desirable data-efficient learning scheme. Experimental results on the NR3D and ScanRefer datasets demonstrate our superiority in low-resource scenarios. In particular Vigor surpasses current state-of-the-art frameworks by 9.3% and 7.6% grounding accuracy under 1% data and 10% data settings on the NR3D dataset respectively.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Wu_2025_WACV, author = {Wu, Tung-Yu and Huang, Sheng-Yu and Wang, Yu-Chiang Frank}, title = {Data-Efficient 3D Visual Grounding via Order-Aware Referring}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {3107-3117} }