RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions

Bimsara Pathiraja, Maitreya Patel, Shivam Singh, Yezhou Yang, Chitta Baral; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 15646-15656

Abstract


Despite recent advances in inversion and instruction-based image editing, existing approaches primarily excel at editing single, prominent objects but significantly struggle when applied to complex scenes containing multiple entities. To quantify this gap, we first introduce **`RefEdit-Bench`**, a rigorous real-world benchmark rooted in RefCOCO, where even baselines trained on millions of samples perform poorly.To overcome this limitation, we introduce **`RefEdit`** -- an instruction-based editing model trained on our scalable synthetic data generation pipeline.Our **`RefEdit`**, trained on only 20,000 editing triplets, outperforms the Flux/SD3 model-based baselines trained on millions of data. Extensive evaluations across various benchmarks demonstrate that our model not only excels in referring expression tasks but also enhances performance on traditional benchmarks, achieving state-of-the-art results comparable to closed-source methods.We will release our code, data, and checkpoints.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Pathiraja_2025_ICCV, author = {Pathiraja, Bimsara and Patel, Maitreya and Singh, Shivam and Yang, Yezhou and Baral, Chitta}, title = {RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {15646-15656} }