Interleaved Vision-and-Language Generation via Generative Voken

Zheng, Kaizhi; He, Xuehai; Wang, Xin Eric

Kaizhi Zheng, Xuehai He, Xin Eric Wang; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2026, pp. 472-482

Abstract

The effectiveness of Multimodal Large Language Models (MLLMs) demonstrates a profound capability in multimodal understanding. However, the simultaneous generation of images with coherent texts is still underdeveloped. Addressing this, we introduce a novel interleaved vision-and-language generation method, centered around the concept of "generative vokens". These vokens serve as pivotal elements contributing to coherent image-text outputs. Our method is marked by a unique two-stage training strategy for description-free multimodal generation, which does not necessitate extensive descriptions of images. We integrate classifier-free guidance to enhance the alignment of generated images and texts, ensuring more seamless and contextually relevant multimodal interactions. Our model, MiniGPT-5, exhibits substantial improvement over the baseline models on multimodal generation datasets, including MMDialog and VIST. The human evaluation shows MiniGPT-5 is better than the baseline model on more than 56% of cases for multimodal generation, highlighting its efficacy across diverse benchmarks.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Zheng_2026_WACV, author = {Zheng, Kaizhi and He, Xuehai and Wang, Xin Eric}, title = {Interleaved Vision-and-Language Generation via Generative Voken}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {March}, year = {2026}, pages = {472-482} }