-
[pdf]
[supp]
[bibtex]@InProceedings{Skorokhodov_2024_CVPR, author = {Skorokhodov, Ivan and Menapace, Willi and Siarohin, Aliaksandr and Tulyakov, Sergey}, title = {Hierarchical Patch Diffusion Models for High-Resolution Video Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {7569-7579} }
Hierarchical Patch Diffusion Models for High-Resolution Video Generation
Abstract
Diffusion models have demonstrated remarkable performance in image and video synthesis. However scaling them to high-resolution inputs is challenging and requires restructuring the diffusion pipeline into multiple independent components limiting scalability and complicating downstream applications. In this work we study patch diffusion models (PDMs) -- a diffusion paradigm which models the distribution of patches rather than whole inputs keeping up to 0.7% of the original pixels. This makes it very efficient during training and unlocks end-to-end optimization on high-resolution videos. We improve PDMs in two principled ways. First to enforce consistency between patches we develop deep context fusion -- an architectural technique that propagates the context information from low-scale to high-scale patches in a hierarchical manner. Second to accelerate training and inference we propose adaptive computation which allocates more network capacity and computation towards coarse image details. The resulting model sets a new state-of-the-art FVD score of 66.32 and Inception Score of 87.68 in class-conditional video generation on UCF-101 256x256 surpassing recent methods by more than 100%. Then we show that it can be rapidly fine-tuned from a base 36x64 low-resolution generator for high-resolution 64x288x512 text-to-video synthesis. To the best of our knowledge our model is the first diffusion-based architecture which is trained on such high resolutions entirely end-to-end. Project webpage: https://snap-research.github.io/hpdm.
Related Material