Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes

Gaurav Shrivastava, Abhinav Shrivastava; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 7236-7245

Abstract


Diffusion models have made significant strides in image generation mastering tasks such as unconditional image synthesis text-image translation and image-to-image conversions. However their capability falls short in the realm of video prediction mainly because they treat videos as a collection of independent images relying on external constraints such as temporal attention mechanisms to enforce temporal coherence. In our paper we introduce a novel model class that treats video as a continuous multi-dimensional process rather than a series of discrete frames. Through extensive experimentation we establish state-of-the-art performance in video prediction validated on benchmark datasets including KTH BAIR Human3.6M and UCF101.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Shrivastava_2024_CVPR, author = {Shrivastava, Gaurav and Shrivastava, Abhinav}, title = {Video Prediction by Modeling Videos as Continuous Multi-Dimensional Processes}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {7236-7245} }