HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision-Language Models

Mushtaq, Erum; Fabian, Zalan; Bakman, Yavuz Faruk; Ramakrishna, Anil; Soltanolkotabi, Mahdi; Avestimehr, Salman

Erum Mushtaq, Zalan Fabian, Yavuz Faruk Bakman, Anil Ramakrishna, Mahdi Soltanolkotabi, Salman Avestimehr; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops, 2025, pp. 1663-1668

Abstract

Assessing the reliability of Vision-Language Models (VLMs) is crucial in high-stakes applications. Uncertainty Estimation (UE) methods are widely used for this purpose. Most existing probability-based UE approaches rely on output probability distributions, aggregating token probabilities into a single uncertainty score using predefined functions. Another line of research leverages model hidden representations, training MLP-based models to predict uncertainty. However, these methods often fall short in capturing the complex semantic and visual relationships between tokens and struggle to identify biased probabilities influenced by language priors. Based on these observations, we propose HARMONY (Hidden Activation Representations and Model Output-aware uNcertaintY Estimation for Vision-Language Models), a transformer-based UE function that jointly leverages model hidden representations and output token probabilities. Our key hypothesis is that both model's internal belief on the vision understanding, and model's output carry reliability signal, and leveraging them both simultaneously provide a better uncertainty estimate. Experimental results on two benchmark open-ended VQA datasets (OKVQA and A-OKVQA) and three state-of-the-art VLMs demonstrate that our method consistently outperforms existing approaches, achieving up to 4% improvement in AUROC and 6% in PRR.

Related Material

[pdf]

[bibtex]

@InProceedings{Mushtaq_2025_CVPR, author = {Mushtaq, Erum and Fabian, Zalan and Bakman, Yavuz Faruk and Ramakrishna, Anil and Soltanolkotabi, Mahdi and Avestimehr, Salman}, title = {HARMONY: Hidden Activation Representations and Model Output-Aware Uncertainty Estimation for Vision-Language Models}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops}, month = {June}, year = {2025}, pages = {1663-1668} }