Lumina-Image 2.0: A Unified and Efficient Image Generative Framework

Qin, Qi; Zhuo, Le; Xin, Yi; Du, Ruoyi; Li, Zhen; Fu, Bin; Lu, Yiting; Li, Xinyue; Liu, Dongyang; Zhu, Xiangyang; Beddow, Will; Millon, Erwann; Perez, Victor; Wang, Wenhai; Qiao, Yu; Zhang, Bo; Liu, Xiaohong; Li, Hongsheng; Xu, Chang; Gao, Peng

Qi Qin, Le Zhuo, Yi Xin, Ruoyi Du, Zhen Li, Bin Fu, Yiting Lu, Xinyue Li, Dongyang Liu, Xiangyang Zhu, Will Beddow, Erwann Millon, Victor Perez, Wenhai Wang, Yu Qiao, Bo Zhang, Xiaohong Liu, Hongsheng Li, Chang Xu, Peng Gao; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 20031-20042

Abstract

We introduce Lumina-Image 2.0, an advanced text-to-image (T2I) model that surpasses previous state-of-the-art methods across multiple benchmarks. Lumina-Image 2.0 is characterized by two key features: (1) Unification - it adopts a unified architecture (Unified Next-DiT) that treats text and image tokens as a joint sequence, enabling natural cross-modal interactions and allowing seamless task expansion. Besides, since high-quality captioners can provide semantically well-aligned text-image training pairs, we introduce a unified captioning system, Unified Captioner (UniCap), which can generate detailed and accurate multilingual captions for our model. This not only accelerates model convergence, but also enhances prompt adherence, multi-granularity prompt handling, and task expansion with customized prompt templates. (2)Efficiency - to improve the efficiency of our proposed model, we develop multi-stage progressive training strategies to optimize our model, alongside inference-time acceleration strategies without compromising image quality. We evaluate our model on academic benchmarks and T2I arenas, with results confirming that it matches or exceeds existing state-of-the-art models across various metrics, highlighting the effectiveness of our methods. We have released our training details, code, and models at https://github.com/Alpha-VLLM/Lumina-Image-2.0.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Qin_2025_ICCV, author = {Qin, Qi and Zhuo, Le and Xin, Yi and Du, Ruoyi and Li, Zhen and Fu, Bin and Lu, Yiting and Li, Xinyue and Liu, Dongyang and Zhu, Xiangyang and Beddow, Will and Millon, Erwann and Perez, Victor and Wang, Wenhai and Qiao, Yu and Zhang, Bo and Liu, Xiaohong and Li, Hongsheng and Xu, Chang and Gao, Peng}, title = {Lumina-Image 2.0: A Unified and Efficient Image Generative Framework}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {20031-20042} }