InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning

Jing Shi, Wei Xiong, Zhe Lin, Hyun Joon Jung; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 8543-8552

Abstract


Recent advances in personalized image generation have enabled pre-trained text-to-image models to learn new concepts from specific image sets. However these methods often necessitate extensive test-time finetuning for each new concept leading to inefficiencies in both time and scalability. To address this challenge we introduce InstantBooth an innovative approach leveraging existing text-to-image models for instantaneous text-guided image personalization eliminating the need for test-time finetuning. This efficiency is achieved through two primary innovations. Firstly we utilize an image encoder that transforms input images into a global embedding to grasp the general concept. Secondly we integrate new adapter layers into the pre-trained model enhancing its ability to capture intricate identity details while maintaining language coherence. Significantly our model is trained exclusively on text-image pairs without reliance on concept-specific paired images. When benchmarked against existing finetuning-based personalization techniques like DreamBooth and Textual-Inversion InstantBooth not only shows comparable proficiency in aligning language with image maintaining image quality and preserving identity but also boasts a 100-fold increase in processing speed.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Shi_2024_CVPR, author = {Shi, Jing and Xiong, Wei and Lin, Zhe and Jung, Hyun Joon}, title = {InstantBooth: Personalized Text-to-Image Generation without Test-Time Finetuning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {8543-8552} }