VipDiff: Towards Coherent and Diverse Video Inpainting via Training-Free Denoising Diffusion Models

Chaohao Xie, Kai Han, Kwan-Yee K. Wong; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 2411-2420

Abstract


Recent video inpainting methods have achieved encouraging improvements by leveraging optical flow to guide pixel propagation from reference frames either in the image space or feature space. However they would produce severe artifacts when the masked area is too large and no pixel correspondences could be found. Recently denoising diffusion models have demonstrated impressive performance in generating diverse and high-quality images and have been exploited in a number of works for image inpainting. These methods however cannot be applied directly to videos to produce temporal-coherent inpainting results. In this paper we propose a training-free framework named VipDiff for conditioning diffusion model on the reverse diffusion process to produce temporal-coherent inpainting results without requiring any training data or fine-tuning the pre-trained models. VipDiff takes optical flow as guidance to extract valid pixels from reference frames to serve as constraints in optimizing the randomly sampled Gaussian noise and uses the generated results for further pixel propagation and conditional generation. VipDiff also allows for generating diverse video inpainting results over different sampled noise. Experiments demonstrate that our VipDiff outperforms state-of-the-art methods in terms of both spatial-temporal coherence and fidelity.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Xie_2025_WACV, author = {Xie, Chaohao and Han, Kai and Wong, Kwan-Yee K.}, title = {VipDiff: Towards Coherent and Diverse Video Inpainting via Training-Free Denoising Diffusion Models}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {2411-2420} }