DiffTell: A High-Quality Dataset for Describing Image Manipulation Changes

Zonglin Di, Jing Shi, Yifei Fan, Hao Tan, Alexander Black, John Collomosse, Yang Liu; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 24580-24590

Abstract


The image difference captioning (IDC) task is to describe the distinctions between two images. However, existing datasets do not offer comprehensive coverage across all image-difference categories. In this work, we introduce a high-quality dataset, DiffTell with various types of image manipulations, including global image alterations, object-level changes, and text manipulations. The data quality is controlled by careful human filtering. Additionally, to scale up the data collection without prohibitive human labor costs, we explore the possibility of automatically filtering for quality control. We demonstrate that both traditional methods and recent multimodal large language models (MLLMs) exhibit performance improvements on the IDC task after training on the DiffTell dataset. Through extensive ablation studies, we provide a detailed analysis of the performance gains attributed to DiffTell. Experiments show DiffTell significantly enhances the availability of resources for IDC research, offering a more comprehensive foundation and benchmark for future investigations.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Di_2025_ICCV, author = {Di, Zonglin and Shi, Jing and Fan, Yifei and Tan, Hao and Black, Alexander and Collomosse, John and Liu, Yang}, title = {DiffTell: A High-Quality Dataset for Describing Image Manipulation Changes}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {24580-24590} }