-
[pdf]
[supp]
[bibtex]@InProceedings{Wang_2026_CVPR, author = {Wang, Hongjun and Liu, Lin and Li, Jianguo and Lin, Tao}, title = {Dual-Granularity Memory for Efficient Video Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {38016-38026} }
Dual-Granularity Memory for Efficient Video Generation
Abstract
Video generation using recurrent architectures offers compelling efficiency advantages over attention-based transformers, particularly for long-sequence generation. However, chunked processing in recurrent models creates temporal discontinuities that harm long-range consistency. We introduce two complementary memory mechanisms to address this challenge at different granularities: (1) Context Memory maintains persistent global context within attention chunks through learnable sink columns and boundary buffers, adding only 150K parameters (\textless 0.1% overhead); (2) Latent Context-as-Memory (LCaM) extends memory across video segments by storing and retrieving historical latent embeddings, enabling cross-segment consistency without requiring camera annotations or frame reconstruction. Applied to Generalized Spatial-temporal Propagation Networks (GSTPN), our dual-memory approach achieves 1.54xfaster inference than attention-based transformers, while excelling in visual quality metrics. Our approach is particularly effective for knowledge distillation scenarios where only pre-extracted latent embeddings are available. This work demonstrates compelling efficiency-quality trade-offs for practical long video generation.
Related Material

