-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Sun_2025_ICCV, author = {Sun, Zeyi and Wu, Tong and Zhang, Pan and Zang, Yuhang and Dong, Xiaoyi and Xiong, Yuanjun and Lin, Dahua and Wang, Jiaqi}, title = {Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {15714-15726} }
Bootstrap3D: Improving Multi-view Diffusion Model with Synthetic Data
Abstract
Recent years have witnessed remarkable progress in multi-view diffusion models for 3D content creation. However, there remains a significant gap in image quality and prompt-following ability compared to 2D diffusion models. A critical bottleneck is the scarcity of high-quality 3D data with detailed captions. To address this challenge, we propose Bootstrap3D, a novel framework that automatically generates filtered multi-view images to assist in training multi-view diffusion models. Specifically, we introduce a data generation pipeline that employs (1) 2D and video diffusion models to generate multi-view images based on constructed text prompts, and (2) our fine-tuned 3D-aware MV-LLaVA for filtering data and rewriting inaccurate captions. Leveraging this pipeline, we have generated large scale synthetic multi-view images with dense descriptive captions. Furthermore, we present a Training Timestep Reschedule (TTR) strategy that leverages the denoising process to learn multi-view consistency while maintaining the original 2D diffusion prior. Extensive experiments demonstrate that Bootstrap3D can generate high-quality multi-view images with superior aesthetic quality, image-text alignment, and view consistency.
Related Material
