DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification

Sitian Shen, Zilin Zhu, Linqian Fan, Harry Zhang, Xinxiao Wu; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 3596-3605

Abstract


Large pre-trained models have revolutionized the field of computer vision by facilitating multi-modal learning. Notably, the CLIP model has exhibited remarkable proficiency in tasks such as image classification, object detection, and semantic segmentation. Nevertheless, its efficacy in processing 3D point clouds is restricted by the domain gap between the depth maps derived from 3D projection and the training images of CLIP. This paper introduces DiffCLIP, a novel pre-training framework that seamlessly integrates stable diffusion with ControlNet. The primary objective of DiffCLIP is to bridge the domain gap inherent in the visual branch. Furthermore, to address few-shot tasks in the textual branch, we incorporate a style-prompt generation module. Extensive experiments on the ModelNet10, ModelNet40, and ScanObjectNN datasets show that DiffCLIP has strong abilities for 3D understanding. By using stable diffusion and style-prompt generation, DiffCLIP achieves an accuracy of 43.2% for zero-shot classification on OBJ_BG of ScanObjectNN, which is state-of-the-art performance, and an accuracy of 82.4% for zero-shot classification on ModelNet10, which is also state-of-the-art performance.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Shen_2024_WACV, author = {Shen, Sitian and Zhu, Zilin and Fan, Linqian and Zhang, Harry and Wu, Xinxiao}, title = {DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {3596-3605} }