An Image is Worth Multiple Words: Multi-Attribute Inversion for Constrained Text-to-Image Synthesis

Agarwal, Aishwarya; Karanam, Srikrishna; Shukla, Tripti; Srinivasan, Balaji Vasan

Aishwarya Agarwal, Srikrishna Karanam, Tripti Shukla, Balaji Vasan Srinivasan; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 6053-6062

Abstract

We consider the problem of constraining diffusion model outputs with a user-supplied reference image. Our key objective is to extract multiple attributes (e.g. color object layout style) from this single reference image and then generate new samples and novel compositions with them. We first perform an extensive attribute distribution analysis that leads to the discovery of an extended conditioning space consisting of multiple textual conditions. These textual conditions vary both per-layer of the U-Net as well as per-timestep of the denoising process. We observe that although the extended conditioning space provides greater control over different attributes of the generated image often a subset of these attributes are captured in the same set of U-Net layers and/or across same denoising timesteps. For instance color and style are captured across same U-Net layers whereas layout and color are captured across same timestep stages. Existing works in multi-attribute constrained image generation proposed extending textual inversion by learning per-layer or per-timestep tokens but they suffer from attribute entanglement for the reasons described above. To address the aforementioned gap we introduce our second contribution where we design a new multi-attribute textual inversion algorithm MATTE with associated disentanglement-enhancing regularization losses that operates jointly across both layer and timestep dimensions and explicitly leads to four disentangled tokens (color style layout and object). We conduct extensive qualitative and quantitative evaluations to demonstrate the effectiveness of the proposed approach.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Agarwal_2025_WACV, author = {Agarwal, Aishwarya and Karanam, Srikrishna and Shukla, Tripti and Srinivasan, Balaji Vasan}, title = {An Image is Worth Multiple Words: Multi-Attribute Inversion for Constrained Text-to-Image Synthesis}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {6053-6062} }