World-consistent Video Diffusion with Explicit 3D Modeling

Zhang, Qihang; Zhai, Shuangfei; Martin, Miguel Ángel Bautista; Miao, Kevin; Toshev, Alexander; Susskind, Joshua; Gu, Jiatao

Qihang Zhang, Shuangfei Zhai, Miguel Ángel Bautista Martin, Kevin Miao, Alexander Toshev, Joshua Susskind, Jiatao Gu; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 21685-21695

Abstract

Recent advancements in diffusion models have set new benchmarks in image and video generation, enabling realistic visual synthesis across single- and multi-frame contexts. However, these models still struggle with efficiently and explicitly generating 3D-consistent content. To address this, we propose World-consistent Video Diffusion (WVD), a novel framework that incorporates explicit 3D supervision using XYZ images, which encode global 3D coordinates for each image pixel. More specifically, we train a diffusion transformer to learn the joint distribution of RGB and XYZ frames. This approach supports multi-task adaptability via a flexible inpainting strategy. For example, WVD can estimate XYZ frames from ground-truth RGB or generate novel RGB frames using XYZ projections along a specified camera trajectory. In doing so, WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation. Our approach demonstrates competitive performance across multiple benchmarks, providing a scalable solution for 3D-consistent video and image generation with a single pretrained model.

Related Material

[pdf]

[bibtex]

@InProceedings{Zhang_2025_CVPR, author = {Zhang, Qihang and Zhai, Shuangfei and Martin, Miguel \'Angel Bautista and Miao, Kevin and Toshev, Alexander and Susskind, Joshua and Gu, Jiatao}, title = {World-consistent Video Diffusion with Explicit 3D Modeling}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {21685-21695} }