Loom: Diffusion-Transformer for Interleaved Generation

Mingcheng Ye, Jiaming Liu, Yiren Song; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings, 2026, pp. 4582-4592

Abstract


Interleaved text-image generation aims to jointly produce coherent visual frames and aligned textual descriptions within a single sequence, enabling tasks such as style transfer, compositional synthesis, and procedural tutorials. We present Loom, a unified diffusion-transformer framework for interleaved text-image generation. Loom extends the Bagel unified model via full-parameter fine-tuning and an interleaved architecture that alternates textual and visual embeddings for multi-condition reasoning and sequential planning. A language-planning module first decomposes a user instruction into stepwise prompts and frame embeddings, which guide temporally consistent synthesis. For each frame, Loom conditions on a small set of sampled prior frames together with the global textual context, rather than concatenating all history, yielding controllable and efficient long-horizon generation. Across style transfer, compositional generation, and tutorial-like procedures, Loom delivers superior compositionality, temporal coherence, and text-image alignment. Loom outperforms Bagel by over 50% (text-to-interleaved) and 44% (image-to-interleaved), surpasses Anole by over 51% in the text setting. We also curate a 50K interleaved tutorial dataset and demonstrate strong improvements over unified and diffusion editing baselines.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Ye_2026_CVPR, author = {Ye, Mingcheng and Liu, Jiaming and Song, Yiren}, title = {Loom: Diffusion-Transformer for Interleaved Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Findings}, month = {June}, year = {2026}, pages = {4582-4592} }