HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation

Leng, Zhiying; Birdal, Tolga; Liang, Xiaohui; Tombari, Federico

Zhiying Leng, Tolga Birdal, Xiaohui Liang, Federico Tombari; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 19691-19700

Abstract

3D shape generation from text is a fundamental task in 3D representation learning. The text-shape pairs exhibit a hierarchical structure where a general text like "chair" covers all 3D shapes of the chair while more detailed prompts refer to more specific shapes. Furthermore both text and 3D shapes are inherently hierarchical structures. However existing Text2Shape methods such as SDFusion do not exploit that. In this work we propose HyperSDFusion a dual-branch diffusion model that generates 3D shapes from a given text. Since hyperbolic space is suitable for handling hierarchical data we propose to learn the hierarchical representations of text and 3D shapes in hyperbolic space. First we introduce a hyperbolic text-image encoder to learn the sequential and multi-modal hierarchical features of text in hyperbolic space. In addition we design a hyperbolic text-graph convolution module to learn the hierarchical features of text in hyperbolic space. In order to fully utilize these text features we introduce a dual-branch structure to embed text features in 3D feature space. At last to endow the generated 3D shapes with a hierarchical structure we devise a hyperbolic hierarchical loss. Our method is the first to explore the hyperbolic hierarchical representation for text-to-shape generation. Experimental results on the existing text-to-shape paired dataset Text2Shape achieved state-of-the-art results. We release our implementation under HyperSDFusion.github.io.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Leng_2024_CVPR, author = {Leng, Zhiying and Birdal, Tolga and Liang, Xiaohui and Tombari, Federico}, title = {HyperSDFusion: Bridging Hierarchical Structures in Language and Geometry for Enhanced 3D Text2Shape Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {19691-19700} }