-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Zhang_2025_CVPR, author = {Zhang, Qihui and Ning, Munan and Liu, Zheyuan and Huang, Yue and Yang, Shuo and Wang, Yanbo and Ye, Jiayi and Chen, Xiao and Song, Yibing and Yuan, Li}, title = {UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2025}, pages = {9165-9174} }
UPME: An Unsupervised Peer Review Framework for Multimodal Large Language Model Evaluation
Abstract
Multimodal Large Language Models (MLLMs) have emerged to tackle the challenges of Visual Question Answering (VQA), sparking a new research focus on conducting objective evaluations of these models. Existing evaluation mechanisms face limitations due to the significant human workload required to design Q&A pairs for visual images, which inherently restricts the scale and scope of evaluations. Although automated MLLM-as-judge approaches attempt to reduce the human workload through mutual model evaluations, they often introduce biases. To address these problems, we propose an unsupervised evaluation method -- an Unsupervised Peer review MLLM Evaluation framework. This framework utilizes only image data, allowing models to automatically generate questions and conduct peer review assessments of answers from other models, effectively alleviating the reliance on human workload. Additionally, we introduce the vision-language scoring system to mitigate the bias issues, which focuses on three aspects: (i) response correctness; (ii) the model capability of visual understanding and reasoning; (iii) relevance of text-image matching. Experimental results demonstrate that UPME achieves a Pearson correlation of 0.944 with human evaluations on the MMstar dataset and 0.814 on the ScienceQA dataset, indicating that our UPME framework closely aligns with human-designed QA benchmarks and inherent human preferences.
Related Material