Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation

Qin Guo, Tianwei Lin; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 6986-6996

Abstract


Recently diffusion-based methods like InstructPix2Pix (IP2P) have achieved effective instruction-based image editing requiring only natural language instructions from the user. However these methods often inadvertently alter unintended areas and struggle with multi-instruction editing resulting in compromised outcomes. To address these issues we introduce the Focus on Your Instruction (FoI) a method designed to ensure precise and harmonious editing across multiple instructions without extra training or test-time optimization. In the FoI we primarily emphasize two aspects: (1) precisely extracting regions of interest for each instruction and (2) guiding the denoising process to concentrate within these regions of interest. For the first objective we identify the implicit grounding capability of IP2P from the cross-attention between instruction and image then develop an effective mask extraction method.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Guo_2024_CVPR, author = {Guo, Qin and Lin, Tianwei}, title = {Focus on Your Instruction: Fine-grained and Multi-instruction Image Editing by Attention Modulation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {6986-6996} }