Mixture-of-Scores: Robust Image-Text Data Valuation via Three Lines of Code

Sitong Wu, Haoru Tan, Yukang Chen, Shaofeng Zhang, Jingyao Li, Bei Yu, Xiaojuan Qi, Jiaya Jia; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 24603-24614

Abstract


Evaluating the quality of image-text pairs is essential for data processing in vision-language pre-training. Most metrics currently use off-the-shelf models, like CLIP-Score, to score pairs based on feature similarity. However, we find that different scoring models often produce inconsistent quality scores for the same data. This disparity impacts data processing results, leading to variations in datasets and, consequently, in model performance when trained on these datasets. Notably, no single quality score excels across all tasks, as each has biases toward specific concepts, resulting in complementary effects on model performance. This complicates the selection of scoring models. In this paper, we analyze these disparities and propose a method called Mixture-of-Scores (MoS). This approach integrates various quality scores into a robust ensemble score, effectively mitigating biases. It can be implemented easily in just three lines of code. Our extensive experiments show that MoS outperforms existing single quality scores across multiple vision-language tasks and benchmarks. We aim to offer new insights and practical tools to help the community navigate the challenges of scoring model selection.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Wu_2025_ICCV, author = {Wu, Sitong and Tan, Haoru and Chen, Yukang and Zhang, Shaofeng and Li, Jingyao and Yu, Bei and Qi, Xiaojuan and Jia, Jiaya}, title = {Mixture-of-Scores: Robust Image-Text Data Valuation via Three Lines of Code}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {24603-24614} }