-
[pdf]
[supp]
[bibtex]@InProceedings{Liu_2025_ICCV, author = {Liu, Songhua and Yu, Ruonan and Wang, Xinchao}, title = {UniversalBooth: Model-Agnostic Personalized Text-to-Image Generation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {18314-18324} }
UniversalBooth: Model-Agnostic Personalized Text-to-Image Generation
Abstract
Given a source image, personalized text-to-image generation produces images preserving the identity and appearance while following the text prompts. Existing methods heavily rely on test-time optimization to achieve this customization. Although some recent works are dedicated to zero-shot personalization, they still require re-training when applied to different text-to-image diffusion models. In this paper, we instead propose a model-agnostic personalized method termed UniversalBooth. At the heart of our approach lies a novel cross-attention mechanism, where different blocks in the same diffusion scale share common square transformation matrices of key and value. In this way, the image encoder is decoupled from the diffusion architecture while maintaining its effectiveness. Moreover, the cross-attention performs hierarchically: the holistic attention first captures the global semantics of user inputs for textual combination with editing prompts, and the fine-grained attention divides the holistic attention scores for various local patches to enhance appearance consistency. To improve the performance when deployed on unseen diffusion models, we further devise an optimal transport prior to the model and encourage the attention scores allocated by cross-attention to fulfill the optimal transport constraint. Experiments demonstrate that our personalized generation model can be generalized to unseen text-to-image diffusion models with a wide spectrum of architectures and functionalities without any additional optimization, while other methods cannot. Meanwhile, it achieves comparable zero-shot personalization performance on seen architectures with existing works.
Related Material