WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

Chow, Wei; Pan, Jiachun; Liang, Yongyuan; Zhou, Mingze; Song, Xue; Jia, Liyu; Zhang, Saining; Tang, Siliang; Li, Juncheng; Zhang, Fengda; Wu, Weijia; Zhang, Hanwang; Chua, Tat-Seng

Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song, Liyu Jia, Saining Zhang, Siliang Tang, Juncheng Li, Fengda Zhang, Weijia Wu, Hanwang Zhang, Tat-Seng Chua; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 15343-15353

Abstract

Recent unified multimodal models (UMMs) have achieved remarkable progress in visual comprehension and generation. However, existing datasets and benchmarks focus predominantly on single-turn interactions, overlooking the multi-turn, context-dependent nature of real-world image creation and editing. To bridge this gap, we introduce W E A V E, the first comprehensive suite for in-context interleaved cross-modality comprehension and generation, comprising two complementary components. W E A V E-100k is a large-scale dataset of 100,000 interleaved samples spanning over 370,000 dialogue turns and 500,000 images, encompassing comprehension, editing, and generation tasks that demand reasoning over prior context. W E A V E-Bench is a human-annotated benchmark of 100 items with 480 images, equipped with a hybrid VLM judge evaluation framework that jointly leverages reference images and original-image-instruction pairs to assess multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments show that training on W E A V E-100k substantially improves vision comprehension, image editing, and comprehension-generation collaboration, while further enabling the emergence of visual-memory capabilities in UMMs. Extensive evaluations on W E A V E-Bench reveal persistent limitations of current approaches in multi-turn, context-aware image generation and editing. We hope W E A V E provides both a perspective and a foundation for advancing in-context interleaved comprehension and generation in the multimodal community.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Chow_2026_CVPR, author = {Chow, Wei and Pan, Jiachun and Liang, Yongyuan and Zhou, Mingze and Song, Xue and Jia, Liyu and Zhang, Saining and Tang, Siliang and Li, Juncheng and Zhang, Fengda and Wu, Weijia and Zhang, Hanwang and Chua, Tat-Seng}, title = {WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {15343-15353} }