ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models

Lukas Höllein, Aljaž Boži?, Norman Müller, David Novotny, Hung-Yu Tseng, Christian Richardt, Michael Zollhöfer, Matthias Nießner; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 5043-5052

Abstract


3D asset generation is getting massive amounts of attention inspired by the recent success on text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data which often results in non-photorealistic 3D objects without backgrounds. In this paper we present a method that leverages pretrained text-to-image models as a prior and learn to generate multi-view images in a single denoising process from real-world data. Concretely we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods the results generated by our method are consistent and have favorable visual quality (-30% FID -37% KID).

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Hollein_2024_CVPR, author = {H\"ollein, Lukas and Bo\v{z}i?, Alja\v{z} and M\"uller, Norman and Novotny, David and Tseng, Hung-Yu and Richardt, Christian and Zollh\"ofer, Michael and Nie{\ss}ner, Matthias}, title = {ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {5043-5052} }