SEED-Story: Multimodal Long Story Generation with Large Language Model

Yang, Shuai; Ge, Yuying; Li, Yang; Chen, Yukang; Ge, Yixiao; Shan, Ying; Chen, Ying-Cong

Shuai Yang, Yuying Ge, Yang Li, Yukang Chen, Yixiao Ge, Ying Shan, Ying-Cong Chen; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 1850-1860

Abstract

Advances in image generation and open-form text generation have paved the way for tackling the challenging task of multimodal long story generation. In our work, we introduce SEED-Story, a novel approach that extends Multimodal Large Language Models (MLLMs) to generate coherent, extended narratives composed of both interleaved text and images. By leveraging robust MLLMs, our model predicts text tokens and regresses visual features that are subsequently refined through an adapted de-tokenizer, ensuring that generated images consistently depict recurring characters and maintain a unified visual style. Furthermore, we introduce a multimodal attention sink mechanism to overcome the train-short test-long challenge. This mechanism retains recent tokens while preserving critical tokens from both the start and end of image sequences, enabling efficient autoregressive generation of long stories that can extend to 25 sequences, even though training is performed on only 10 sequences. To support our research, we also introduce StoryStream, a large-scale, high-resolution dataset tailored for multimodal long story generation. StoryStream offers longer narrative sequences and richer visual details than previous datasets, providing a robust benchmark for evaluating image style consistency, story engagement, and image-text coherence. Experimental results demonstrate that SEED-Story produces rich narrative plots and diverse visual scenarios across extended multimodal sequences.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Yang_2025_ICCV, author = {Yang, Shuai and Ge, Yuying and Li, Yang and Chen, Yukang and Ge, Yixiao and Shan, Ying and Chen, Ying-Cong}, title = {SEED-Story: Multimodal Long Story Generation with Large Language Model}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {1850-1860} }