FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing

Zhang, Youyuan; Ju, Xuan; Clark, James J.

Youyuan Zhang, Xuan Ju, James J. Clark; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 3657-3666

Abstract

Diffusion models have demonstrated remarkable capabilities in text-to-image and text-to-video generation opening up possibilities for video editing based on textual input. However the computational cost associated with sequential sampling in diffusion models poses challenges for efficient video editing. Existing approaches relying on image generation models for video editing suffer from time-consuming one-shot fine-tuning additional condition extraction or DDIM inversion making real-time applications impractical. In this work we propose FastVideoEdit an efficient zero-shot video editing approach inspired by Consistency Models (CMs). By leveraging the self-consistency property of CMs we eliminate the need for time-consuming inversion or additional condition extraction reducing editing time. Our method enables direct mapping from source video to target video with strong preservation ability through attention control. This results in improved speed advantages as fewer sampling steps can be used while maintaining comparable generation quality. Experimental results validate the state-of-the-art performance and speed advantages of FastVideoEdit across evaluation metrics encompassing editing speed temporal consistency and text-video alignment. The source code is available at github.com/youyuan-zhang/FastVideoEdit.

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{Zhang_2025_WACV, author = {Zhang, Youyuan and Ju, Xuan and Clark, James J.}, title = {FastVideoEdit: Leveraging Consistency Models for Efficient Text-to-Video Editing}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {3657-3666} }