-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Khan_2024_CVPR, author = {Khan, Zaid and Fu, Yun}, title = {Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {10854-10863} }
Consistency and Uncertainty: Identifying Unreliable Responses From Black-Box Vision-Language Models for Selective Visual Question Answering
Abstract
The goal of selective prediction is to allow an a model to abstain when it may not be able to deliver a reliable prediction which is important in safety-critical contexts. Existing approaches to selective prediction typically require access to the internals of a model require retraining a model or study only unimodal models. However the most powerful models (e.g. GPT-4) are typically only available as black boxes with inaccessible internals are not retrainable by end-users and are frequently used for multimodal tasks. We study the possibility of selective prediction for vision-language models in a realistic black-box setting. We propose using the principle of neighborhood consistency to identify unreliable responses from a black-box vision-language model in question answering tasks. We hypothesize that given only a visual question and model response the consistency of the model's responses over the neighborhood of a visual question will indicate reliability. It is impossible to directly sample neighbors in feature space in a black-box setting. Instead we show that it is possible to use a smaller proxy model to approximately sample from the neighborhood. We find that neighborhood consistency can be used to identify model responses to visual questions that are likely unreliable even in adversarial settings or settings that are out-of-distribution to the proxy model.
Related Material