ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding

Xue, Le; Yu, Ning; Zhang, Shu; Panagopoulou, Artemis; Li, Junnan; Martín-Martín, Roberto; Wu, Jiajun; Xiong, Caiming; Xu, Ran; Niebles, Juan Carlos; Savarese, Silvio

Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Junnan Li, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, Silvio Savarese; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 27091-27101

Abstract

Recent advancements in multimodal pre-training have shown promising efficacy in 3D representation learning by aligning multimodal features across 3D shapes their 2D counterparts and language descriptions. However the methods used by existing frameworks to curate such multimodal data in particular language descriptions for 3D shapes are not scalable and the collected language descriptions are not diverse. To address this we introduce ULIP-2 a simple yet effective tri-modal pretraining framework that leverages large multimodal models to automatically generate holistic language descriptions for 3D shapes. It only needs 3D data as input eliminating the need for any manual 3D annotations and is therefore scalable to large datasets. ULIP-2 is also equipped with scaled-up backbones for better multi-modal representation learning. We conduct experiments on two large-scale 3D datasets Objaverse and ShapeNet and augment them with tri-modal datasets of 3D point clouds images and language for training ULIP-2. Experiments show that ULIP-2 demonstrates substantial benefits in three downstream tasks: zero-shot 3D classification standard 3D classification with fine-tuning and 3D captioning (3D-to-language generation). It achieves a new SOTA of 50.6% (top- 1) on Objaverse-LVIS and 84.7% (top-1) on ModelNet40 in zero-shot classification. In the ScanObjectNN benchmark for standard fine-tuning ULIP-2 reaches an overall accuracy of 91.5% with a compact model of only 1.4 million parameters. ULIP-2 sheds light on a new paradigm for scalable multimodal 3D representation learning without human annotations and shows significant improvements over existing baselines. The code and datasets are released at https://github.com/salesforce/ULIP.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Xue_2024_CVPR, author = {Xue, Le and Yu, Ning and Zhang, Shu and Panagopoulou, Artemis and Li, Junnan and Mart{\'\i}n-Mart{\'\i}n, Roberto and Wu, Jiajun and Xiong, Caiming and Xu, Ran and Niebles, Juan Carlos and Savarese, Silvio}, title = {ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {27091-27101} }