A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality

Elmoghany, Mohamed; Rossi, Ryan; Yoon, Seunghyun; Mukherjee, Subhojyoti; Bakr, Eslam Mohamed; Mathur, Puneet; Wu, Gang; Lai, Viet Dac; Lipka, Nedim; Zhang, Ruiyi; Manjunatha, Varun; Van Nguyen, Chien; Dangi, Daksh; Salinas, Abel; Chen, Hongjie; Huang, Xiaolei; Barrow, Joe; Ahmed, Nesreen; Eldardiry, Hoda; Park, Namyong; Wang, Yu; Tu, Zhengzhong; Nguyen, Thien Huu; Manocha, Dinesh; Elhoseiny, Mohamed; Dernoncourt, Franck

Mohamed Elmoghany, Ryan Rossi, Seunghyun Yoon, Subhojyoti Mukherjee, Eslam Mohamed Bakr, Puneet Mathur, Gang Wu, Viet Dac Lai, Nedim Lipka, Ruiyi Zhang, Varun Manjunatha, Chien Van Nguyen, Daksh Dangi, Abel Salinas, Hongjie Chen, Xiaolei Huang, Joe Barrow, Nesreen Ahmed, Hoda Eldardiry, Namyong Park, Yu Wang, Zhengzhong Tu, Thien Huu Nguyen, Dinesh Manocha, Mohamed Elhoseiny, Franck Dernoncourt; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 7082-7094

Abstract

Despite the recent progress in video generative models, existing state-of-the-art methods can only produce videos lasting 5-16 seconds, often labeled "long-form videos". Furthermore, videos exceeding 16 seconds struggle to maintain consistent character appearances and scene layouts throughout the narrative. In particular, multi-subject long videos still fail to preserve character consistency and motion coherence. While some methods can generate videos up to 150 seconds long, they often suffer from frame redundancy and low temporal diversity. Recent work has attempted to produce long-form videos featuring multiple characters, narrative coherence, and high-fidelity detail. We studied 32 papers on long-video generation to identify key architectural components and training strategies that consistently yield these qualities. We also construct a comprehensive novel taxonomy of existing methods and present comparative tables that categorize papers by their architectural designs and performance characteristics.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Elmoghany_2025_ICCV, author = {Elmoghany, Mohamed and Rossi, Ryan and Yoon, Seunghyun and Mukherjee, Subhojyoti and Bakr, Eslam Mohamed and Mathur, Puneet and Wu, Gang and Lai, Viet Dac and Lipka, Nedim and Zhang, Ruiyi and Manjunatha, Varun and Van Nguyen, Chien and Dangi, Daksh and Salinas, Abel and Chen, Hongjie and Huang, Xiaolei and Barrow, Joe and Ahmed, Nesreen and Eldardiry, Hoda and Park, Namyong and Wang, Yu and Tu, Zhengzhong and Nguyen, Thien Huu and Manocha, Dinesh and Elhoseiny, Mohamed and Dernoncourt, Franck}, title = {A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {7082-7094} }