Corgi: Cached Memory Guided Video Generation

Xindi Wu, Uriel Singer, Zhaojiang Lin, Andrea Madotto, Xide Xia, Yifan Xu, Paul Crook, Xin Luna Dong, Seungwhan Moon; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 4585-4594

Abstract


Text-to-Video generation has achieved remarkable progress with the rise of diffusion models. In this work we introduce Cached Memory-Guided Video Generation (Corgi) aiming to generate multi-scene videos with arbitrary number of video clips conditioned on input images and instruction prompts. This is a challenging task as traditional T2V methods often struggle to maintain the quality of longer videos due to the difficulties in preserving visual context from earlier scenes. We address this by introducing a cached memory mechanism that stores the key frames. Our multi-scene video generation process is explicitly conditioned on the cached memories to avoid forgetting the visual appearance of target subjects. Corgi shows significant improvement in multi-scene video generation compared to the prior art with up to 59.2% in long-term consistency and 7.6% in diversity.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Wu_2025_WACV, author = {Wu, Xindi and Singer, Uriel and Lin, Zhaojiang and Madotto, Andrea and Xia, Xide and Xu, Yifan and Crook, Paul and Dong, Xin Luna and Moon, Seungwhan}, title = {Corgi: Cached Memory Guided Video Generation}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {4585-4594} }