ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models

Uddeshya Upadhyay, Shyamgopal Karthik, Massimiliano Mancini, Zeynep Akata; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 1899-1910

Abstract


Large-scale vision-language models (VLMs) like CLIP successfully find correspondences between images and text. Through the standard deterministic mapping process, an image or a text sample is mapped to a single vector in the embedding space. This is problematic: as multiple samples (images or text) can abstract the same concept in the physical world, deterministic embeddings do not reflect the inherent ambiguity in the embedding space. We propose ProbVLM, a probabilistic adapter that estimates probability distributions for the embeddings of pre-trained VLMs via inter/intra-modal alignment in a post-hoc manner without needing large-scale datasets or computing. On four challenging datasets, i.e., COCO, Flickr, CUB, and Oxford-flowers, we estimate the multi-modal embedding uncertainties for two VLMs, i.e., CLIP and BLIP, quantify the calibration of embedding uncertainties in retrieval tasks and show that ProbVLM outperforms other methods. Furthermore, we propose active learning and model selection as two real-world downstream tasks for VLMs and show that the estimated uncertainty aids both tasks. Lastly, we present a novel technique for visualizing the embedding distributions using a large-scale pre-trained latent diffusion model.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Upadhyay_2023_ICCV, author = {Upadhyay, Uddeshya and Karthik, Shyamgopal and Mancini, Massimiliano and Akata, Zeynep}, title = {ProbVLM: Probabilistic Adapter for Frozen Vison-Language Models}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {1899-1910} }