LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning

Sijin Chen, Xin Chen, Chi Zhang, Mingsheng Li, Gang Yu, Hao Fei, Hongyuan Zhu, Jiayuan Fan, Tao Chen; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 26428-26438

Abstract


Recent progress in Large Multimodal Models (LMM) has opened up great possibilities for various applications in the field of human-machine interactions. However developing LMMs that can comprehend reason and plan in complex and diverse 3D environments remains a challenging topic especially considering the demand for understanding permutation-invariant point cloud representations of the 3D scene. Existing works seek help from multi-view images by projecting 2D features to 3D space which inevitably leads to huge computational overhead and performance degradation. In this paper we present LL3DA a Large Language 3D Assistant that takes point cloud as the direct input and responds to both text instructions and visual interactions. The additional visual interaction enables LMMs to better comprehend human interactions with the 3D environment and further remove the ambiguities within plain texts. Experiments show that LL3DA achieves remarkable results and surpasses various 3D vision-language models on both 3D Dense Captioning and 3D Question Answering.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Chen_2024_CVPR, author = {Chen, Sijin and Chen, Xin and Zhang, Chi and Li, Mingsheng and Yu, Gang and Fei, Hao and Zhu, Hongyuan and Fan, Jiayuan and Chen, Tao}, title = {LL3DA: Visual Interactive Instruction Tuning for Omni-3D Understanding Reasoning and Planning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {26428-26438} }