Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent

Ci, En; Guan, Shanyan; Ge, Yanhao; Zhang, Yilin; Li, Wei; Zhang, Zhenyu; Yang, Jian; Tai, Ying

En Ci, Shanyan Guan, Yanhao Ge, Yilin Zhang, Wei Li, Zhenyu Zhang, Jian Yang, Ying Tai; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 19185-19194

Abstract

Despite the progress in text-to-image generation, semantic image editing remains a challenge. Inversion-based algorithms unavoidably introduce reconstruction errors, while instruction-based models mainly suffer from limited dataset quality and scale. To address these problems, we propose a descriptive-prompt-based editing framework, named DescriptiveEdit. The core idea is to re-frame `instruction-based image editing' as `reference-image-based text-to-image generation', which preserves the generative power of well-trained Text-to-Image models without architectural modifications or inversion. Specifically, taking the reference image and a prompt as input, we introduce a Cross-Attentive UNet, which newly adds attention bridges to inject reference image features into the prompt-to-edit-image generation process. Owing to its text-to-image nature, DescriptiveEdit overcomes limitations in instruction dataset quality, integrates seamlessly with ControlNet, IP-Adapter, and other extensions, and is more scalable. Experiments on the Emu Edit benchmark show it improves editing accuracy and consistency.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Ci_2025_ICCV, author = {Ci, En and Guan, Shanyan and Ge, Yanhao and Zhang, Yilin and Li, Wei and Zhang, Zhenyu and Yang, Jian and Tai, Ying}, title = {Describe, Don't Dictate: Semantic Image Editing with Natural Language Intent}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {19185-19194} }