Generating Long-Take Videos via Effective Keyframes and Guidance

Huang, Hsin-Ping; Su, Yu-Chuan; Yang, Ming-Hsuan

Hsin-Ping Huang, Yu-Chuan Su, Ming-Hsuan Yang; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 3709-3720

Abstract

We tackle the challenge of generating long-take videos encompassing multiple non-repetitive yet coherent events. Existing approaches generate long videos conditioned on single input guidance often leading to repetitive content. To address this problem we develop a framework that uses multiple guidance sources to enhance long video generation. The main idea of our approach is to decouple video generation into keyframe generation and frame interpolation. In this process keyframe generation focuses on creating multiple coherent events while the frame interpolation stage generates smooth intermediate frames between keyframes using existing video generation models. A novel mask attention module is further introduced to improve coherence and efficiency. Experiments on challenging real-world videos demonstrate that the proposed method outperforms prior methods by up to 9.5% in objective metrics.

Related Material

[pdf]

[bibtex]

@InProceedings{Huang_2025_WACV, author = {Huang, Hsin-Ping and Su, Yu-Chuan and Yang, Ming-Hsuan}, title = {Generating Long-Take Videos via Effective Keyframes and Guidance}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {3709-3720} }