Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models

Chang Liu, Haoning Wu, Yujie Zhong, Xiaoyun Zhang, Yanfeng Wang, Weidi Xie; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 6190-6200

Abstract


Generative models have recently exhibited exceptional capabilities in text-to-image generation but still struggle to generate image sequences coherently. In this work we focus on a novel yet challenging task of generating a coherent image sequence based on a given storyline denoted as open-ended visual storytelling. We make the following three contributions: (i) to fulfill the task of visual storytelling we propose a learning-based auto-regressive image generation model termed as StoryGen with a novel vision-language context module that enables to generate the current frame by conditioning on the corresponding text prompt and preceding image-caption pairs; (ii) to address the data shortage of visual storytelling we collect paired image-text sequences by sourcing from online videos and open-source E-books establishing processing pipeline for constructing a large-scale dataset with diverse characters storylines and artistic styles named StorySalon; (iii) Quantitative experiments and human evaluations have validated the superiority of our StoryGen where we show it can generalize to unseen characters without any optimization and generate image sequences with coherent content and consistent character. Code dataset and models are available at https://haoningwu3639.github.io/StoryGen_Webpage/

Related Material


[pdf]
[bibtex]
@InProceedings{Liu_2024_CVPR, author = {Liu, Chang and Wu, Haoning and Zhong, Yujie and Zhang, Xiaoyun and Wang, Yanfeng and Xie, Weidi}, title = {Intelligent Grimm - Open-ended Visual Storytelling via Latent Diffusion Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {6190-6200} }