STDD: Spatio-Temporal Dual Diffusion for Video Generation

Yao, Shuaizhen; Zhang, Xiaoya; Liu, Xin; Liu, Mengyi; Cui, Zhen

Shuaizhen Yao, Xiaoya Zhang, Xin Liu, Mengyi Liu, Zhen Cui; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 12575-12584

Abstract

Diffusion probabilistic model is becoming the cornerstone of data generation, especially generating high-quality images. As an extension, video diffusion generation is in urgent need of a principled temporal-sequence diffusion way, while the spatial-domain diffusion dominates most video diffusion methods. In this work, we propose an explicit Spatio-Temporal Dual Diffusion (STDD) method by principledly extending the standard diffusion model to a spatio-temporal diffusion model for joint spatial and temporal noise propagation/reduction. Mathematically, an analysable dual diffusion process is derived to accumulate noises/information in temporal sequence as well as spatial domain. Correspondingly, we theoretically derive a spatio-temporal probabilistic reverse diffusion process and propose an accelerated sampling way to reduce the inference cost. In principle, the spatio-temporal dual diffusion enables the information of previous frames to be transferred to the current frame, which thus could be beneficial for video consistency. Extensive experiments demonstrate that our proposed STDD is more competitive over the state-of-the-art methods in the task of video generation/prediction as well as text-to-video generation.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Yao_2025_CVPR, author = {Yao, Shuaizhen and Zhang, Xiaoya and Liu, Xin and Liu, Mengyi and Cui, Zhen}, title = {STDD: Spatio-Temporal Dual Diffusion for Video Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {12575-12584} }