-
[pdf]
[supp]
[bibtex]@InProceedings{Hollein_2024_CVPR, author = {H\"ollein, Lukas and Bo\v{z}i\v{c}, Alja\v{z} and M\"uller, Norman and Novotny, David and Tseng, Hung-Yu and Richardt, Christian and Zollh\"ofer, Michael and Nie{\ss}ner, Matthias}, title = {ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {5043-5052} }
ViewDiff: 3D-Consistent Image Generation with Text-to-Image Models
Abstract
3D asset generation is getting massive amounts of attention inspired by the recent success on text-guided 2D content creation. Existing text-to-3D methods use pretrained text-to-image diffusion models in an optimization problem or fine-tune them on synthetic data which often results in non-photorealistic 3D objects without backgrounds. In this paper we present a method that leverages pretrained text-to-image models as a prior and learn to generate multi-view images in a single denoising process from real-world data. Concretely we propose to integrate 3D volume-rendering and cross-frame-attention layers into each block of the existing U-Net network of the text-to-image model. Moreover we design an autoregressive generation that renders more 3D-consistent images at any viewpoint. We train our model on real-world datasets of objects and showcase its capabilities to generate instances with a variety of high-quality shapes and textures in authentic surroundings. Compared to the existing methods the results generated by our method are consistent and have favorable visual quality (-30% FID -37% KID).
Related Material