Improving Large Vision and Language Models by Learning from a Panel of Peers

Jefferson Hernandez, Jing Shi, Simon Jenni, Vicente Ordonez, Kushal Kafle; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 1402-1412

Abstract


Traditional alignment methods for Large Vision and Language Models (LVLMs) primarily rely on human-curated preference data. Human-generated preference data is costly; machine-generated preference data is limited in quality; and self-supervised preference data often introduces hallucinations. To overcome these limitations, we propose a novel Panel-of-Peers learning framework inspired by collaborative learning among humans. This approach leverages a panel of LVLMs, each evaluating and learning from their collective outputs through an iterative self-improvement process. By simulating a peer review system, our models generate, assess, and refine outputs in response to a curated set of prompts, mimicking a classroom learning environment. We demonstrate that this methodology enhances model performance without requiring extensive human-labeled datasets. Our experiments show significant improvement across multiple benchmarks, demonstrating the potential of peer evaluations as a scalable alternative to self-supervised alignment. Notably, we show that Panel-of-Peers increases the average score on fifteen benchmarks from 48% to 57%.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Hernandez_2025_ICCV, author = {Hernandez, Jefferson and Shi, Jing and Jenni, Simon and Ordonez, Vicente and Kafle, Kushal}, title = {Improving Large Vision and Language Models by Learning from a Panel of Peers}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {1402-1412} }