Mobile Video Diffusion

Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas, Amir Ghodrati, Amirhossein Habibian; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 19450-19460

Abstract


Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized image-to-video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce the computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schemas to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, can generate latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro, with negligible quality loss. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Ben_Yahia_2025_ICCV, author = {Ben Yahia, Haitam and Korzhenkov, Denis and Lelekas, Ioannis and Ghodrati, Amir and Habibian, Amirhossein}, title = {Mobile Video Diffusion}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {19450-19460} }