One-Minute Video Generation with Test-Time Training

Dalal, Karan; Koceja, Daniel; Xu, Jiarui; Zhao, Yue; Han, Shihao; Cheung, Ka Chun; Kautz, Jan; Choi, Yejin; Sun, Yu; Wang, Xiaolong

Karan Dalal, Daniel Koceja, Jiarui Xu, Yue Zhao, Shihao Han, Ka Chun Cheung, Jan Kautz, Yejin Choi, Yu Sun, Xiaolong Wang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025, pp. 17702-17711

Abstract

Transformers today still struggle to generate one-minute videos because self-attention layers are inefficient for long context. Alternatives such as Mamba layers struggle to produce coherent scenes because their hidden states are small and less expressive. We experiment with Test-Time Training (TTT) layers, whose hidden states themselves can be neural networks, therefore larger and more expressive. Adding TTT layers into a pre-trained Transformer enables it to generate one-minute videos from text storyboards. We curate a dataset based on Tom and Jerry cartoons as a proof-of-concept benchmark. Compared to baselines such as Mamba 2, Gated DeltaNet, and sliding-window attention layers, TTT layers generate much more coherent videos that tell complete stories, leading by 34 Elo points in a human evaluation of 100 videos per method. Although promising, our results are still limited in physical realism, and the efficiency of our implementation can be further improved. Sample videos, code and annotations are available at: https://test-time-training.github.io/video-dit

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Dalal_2025_CVPR, author = {Dalal, Karan and Koceja, Daniel and Xu, Jiarui and Zhao, Yue and Han, Shihao and Cheung, Ka Chun and Kautz, Jan and Choi, Yejin and Sun, Yu and Wang, Xiaolong}, title = {One-Minute Video Generation with Test-Time Training}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {17702-17711} }