Unified Dense Prediction of Video Diffusion

Lehan Yang, Lu Qi, Xiangtai Li, Sheng Li, Varun Jampani, Ming-Hsuan Yang; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR), 2025, pp. 28963-28973

Abstract


We present a unified network for simultaneously generating videos and their corresponding entity segmentation and depth maps from text prompts. We utilize colormap to represent entity masks and depth maps, tightly integrating dense prediction with RGB video generation. Introducing dense prediction information improves video generation's consistency and motion smoothness without increasing computational costs. Incorporating learnable task embeddings brings multiple dense prediction tasks into a single model, enhancing flexibility and further boosting performance. We further propose a large-scale dense prediction video dataset Panda-Dense, addressing the issue that existing datasets do not concurrently contain captions, videos, segmentation, or depth maps. Comprehensive experiments demonstrate the high efficiency of our method, surpassing the state-of-the-art in terms of video quality, consistency, and motion smoothness.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Yang_2025_CVPR, author = {Yang, Lehan and Qi, Lu and Li, Xiangtai and Li, Sheng and Jampani, Varun and Yang, Ming-Hsuan}, title = {Unified Dense Prediction of Video Diffusion}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR)}, month = {June}, year = {2025}, pages = {28963-28973} }