iEdit: Localised Text-guided Image Editing with Weak Supervision

Rumeysa Bodur, Erhan Gundogdu, Binod Bhattarai, Tae-Kyun Kim, Michael Donoser, Loris Bazzani; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 7426-7435

Abstract


Diffusion models can generate realistic images with text guidance using large-scale datasets. However they demonstrate limited controllability on the generated images. We introduce iEdit a novel method for text-guided image editing conditioned on a source image and textual prompt. As a fully annotated dataset with target images does not exist previous approaches perform subject-specific fine-tuning at test time or adopt contrastive learning without a target image leading to issues on preserving source image fidelity. We propose to automatically construct a dataset derived from LAION 5B containing pseudo-target images and descriptive edit prompts. The dataset allows us to incorporate a weakly supervised loss function generating the pseudo target image from the source image's latent noise conditioned on the edit prompt. To encourage localised editing we propose a loss function that uses segmentation masks to guide the editing during training and optionally at inference. Trained with limited GPU resources on the constructed dataset our model outperforms counterparts in image fidelity CLIP alignment score and qualitatively for both generated and real images.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Bodur_2024_CVPR, author = {Bodur, Rumeysa and Gundogdu, Erhan and Bhattarai, Binod and Kim, Tae-Kyun and Donoser, Michael and Bazzani, Loris}, title = {iEdit: Localised Text-guided Image Editing with Weak Supervision}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {7426-7435} }