-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Ju_2026_CVPR, author = {Ju, Yiming and Hu, Jijin and Luo, Zhengxiong and Deng, Haoge and Zhao, hanyu and Du, Li and Xiao, Wenbo and Wu, Chengwei and Hao, Donglin and Wang, Xinlong and Pan, Tengfei}, title = {CI-VID: A Coherent Interleaved Text-Video Dataset}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {25568-25577} }
CI-VID: A Coherent Interleaved Text-Video Dataset
Abstract
Text-to-video (T2V) generation has recently attracted considerable attention, resulting in the development of numerous high-quality datasets that have propelled progress in this area. However, existing public datasets are primarily composed of isolated text-video (T-V) pairs and thus fail to model inter-clip relationships. To address this limitation, we introduce CI-VID, a dataset that moves beyond isolated T2V generation toward text-and-video-to-video (T&V2V) generation. CI-VID contains over 340,000 samples, each comprising a semantically coherent video sequence with interleaved text captions that capture both clip-level content and inter-clip relationships. To validate its effectiveness, we design a comprehensive, multi-dimensional benchmark incorporating human evaluation, VLM-based assessment, and similarity-based metrics. Experimental results demonstrate that models trained on CI-VID significantly improve both accuracy and content consistency in multi-clip video generation. This enables the creation of story-driven content with smooth transitions and strong semantic coherence.
Related Material

