MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation

Petru-Daniel Tudosiu, Yongxin Yang, Shifeng Zhang, Fei Chen, Steven McDonagh, Gerasimos Lampouras, Ignacio Iacobacci, Sarah Parisot; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 22413-22422

Abstract


Text-to-image generation has achieved astonishing results yet precise spatial controllability and prompt fidelity remain highly challenging. This limitation is typically addressed through cumbersome prompt engineering scene layout conditioning or image editing techniques which often require hand drawn masks. Nonetheless pre-existing works struggle to take advantage of the natural instance-level compositionality of scenes due to the typically flat nature of rasterized RGB output images. Towards addressing this challenge we introduce MuLAn: a novel dataset comprising over 44K MUlti-Layer ANnotations of RGB images as multi-layer instance-wise RGBA decompositions and over 100K instance images. To build MuLAn we developed a training free pipeline which decomposes a monocular RGB image into a stack of RGBA layers comprising of background and isolated instances. We achieve this through the use of pretrained general-purpose models and by developing three modules: image decomposition for instance discovery and extraction instance completion to reconstruct occluded areas and image re-assembly. We use our pipeline to create MuLAn-COCO and MuLAn-LAION datasets which contain a variety of image decompositions in terms of style composition and complexity. With MuLAn we provide the first photorealistic resource providing instance decomposition and occlusion information for high quality images opening up new avenues for text-to-image generative AI research. With this we aim to encourage the development of novel generation and editing technology in particular layer-wise solutions. MuLAn data resources are available at https://MuLAn-dataset.github.io/

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Tudosiu_2024_CVPR, author = {Tudosiu, Petru-Daniel and Yang, Yongxin and Zhang, Shifeng and Chen, Fei and McDonagh, Steven and Lampouras, Gerasimos and Iacobacci, Ignacio and Parisot, Sarah}, title = {MULAN: A Multi Layer Annotated Dataset for Controllable Text-to-Image Generation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {22413-22422} }