-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Ben_Yahia_2025_ICCV, author = {Ben Yahia, Haitam and Korzhenkov, Denis and Lelekas, Ioannis and Ghodrati, Amir and Habibian, Amirhossein}, title = {Mobile Video Diffusion}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {19450-19460} }
Mobile Video Diffusion
Abstract
Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized image-to-video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce the computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schemas to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, can generate latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro, with negligible quality loss. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion
Related Material