3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds

Lichen Zhao, Daigang Cai, Lu Sheng, Dong Xu; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 2928-2937


Visual grounding on 3D point clouds is an emerging vision and language task that benefits various applications in understanding the 3D visual world. By formulating this task as a grounding-by-detection problem, lots of recent works focus on how to exploit more powerful detectors and comprehensive language features, but (1) how to model complex relations for generating context-aware object proposals and (2) how to leverage proposal relations to distinguish the true target object from similar proposals are not fully studied yet. Inspired by the well-known transformer architecture, we propose a relation-aware visual grounding method on 3D point clouds, named as 3DVG-Transformer, to fully utilize the contextual clues for relationenhanced proposal generation and cross-modal proposal disambiguation, which are enabled by a newly designed coordinate-guided contextual aggregation (CCA) module in the object proposal generation stage, and a multiplex attention (MA) module in the cross-modal feature fusion stage. We validate that our 3DVG-Transformer outperforms the state-of-the-art methods by a large margin, on two point cloud-based visual grounding datasets, ScanRefer and Nr3D/Sr3D from ReferIt3D, especially for complex scenarios containing multiple objects of the same category.

Related Material

@InProceedings{Zhao_2021_ICCV, author = {Zhao, Lichen and Cai, Daigang and Sheng, Lu and Xu, Dong}, title = {3DVG-Transformer: Relation Modeling for Visual Grounding on Point Clouds}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {2928-2937} }