Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and Consistency

Yuqi Zhang, Han Luo, Yinjie Lei; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13063-13072

Abstract


3D visual grounding plays a crucial role in scene understanding with extensive applications in AR/VR. Despite the significant progress made in recent methods the requirement of dense textual descriptions for each individual object which is time-consuming and costly hinders their scalability. To mitigate reliance on text annotations during training researchers have explored language-free training paradigms in the 2D field via explicit text generation or implicit feature substitution. Nevertheless unlike 2D images the complexity of spatial relations in 3D coupled with the absence of robust 3D visual language pre-trained models makes it challenging to directly transfer previous strategies. To tackle the above issues in this paper we introduce a language-free training framework for 3D visual grounding. By utilizing the visual-language joint embedding in 2D large cross-modality model as a bridge we can expediently produce the pseudo-language features by leveraging the features of 2D images which are equivalent to that of real textual descriptions. We further develop a relation injection scheme with a Neighboring Relation-aware Modeling module and a Cross-modality Relation Consistency module aiming to enhance and preserve the complex relationships between the 2D and 3D embedding space. Extensive experiments demonstrate that our proposed language-free 3D visual grounding approach can obtain promising performance across three widely used datasets --ScanRefer Nr3D and Sr3D. Our codes are available at https://github.com/xibi777/3DLFVG

Related Material


[pdf]
[bibtex]
@InProceedings{Zhang_2024_CVPR, author = {Zhang, Yuqi and Luo, Han and Lei, Yinjie}, title = {Towards CLIP-driven Language-free 3D Visual Grounding via 2D-3D Relational Enhancement and Consistency}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {13063-13072} }