-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Sharifdeen_2026_CVPR, author = {Sharifdeen, Ashshak and Shamshad, Fahad and Munir, Muhammad Akhtar and Basu, Abhishek and Ismithdeen, Mohamed and Jeyamohan, Jeyapriyan and Silva, Chathurika and Nandakumar, Karthik and Khan, Muhammad Haris}, title = {Towards Calibrating Prompt Tuning of Vision- Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {39131-39140} }
Towards Calibrating Prompt Tuning of Vision- Language Models
Abstract
Prompt tuning of large-scale vision-language models such as CLIP enables efficienttask adaptation without updating model weights. However, it often leads to poorconfidence calibration and unreliable predictive uncertainty. We address thisproblem by proposing a calibration framework that enhances predictive reliabilitywhile preserving the geometry of the pretrained CLIP embedding space, which isrequired for robust generalization. Our approach extends the standard cross-entropyloss with two complementary regularizers: (1) a mean-variance margin penalty thatstabilizes inter-class logit margins by maximizing their average while minimizingdispersion, mitigating underconfidence and overconfidence spikes; and (2) a textmoment-matching loss that aligns the first and second moments of tuned textembeddings with their frozen CLIP counterparts, preserving semantic dispersioncrucial for generalization. Through extensive experiments across 7 prompt-tuningmethods and 11 diverse datasets, we demonstrate that our approach significantlyreduces the Expected Calibration Error (ECE) compared to competitive calibrationtechniques on both base and novel classes.
Related Material

