Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models

Li, Juncheng; Gao, Minghe; Wei, Longhui; Tang, Siliang; Zhang, Wenqiao; Li, Mengze; Ji, Wei; Tian, Qi; Chua, Tat-Seng; Zhuang, Yueting

Juncheng Li, Minghe Gao, Longhui Wei, Siliang Tang, Wenqiao Zhang, Mengze Li, Wei Ji, Qi Tian, Tat-Seng Chua, Yueting Zhuang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 2551-2562

Abstract

Prompt tuning, a recently emerging paradigm, enables the powerful vision-language pre-training models to adapt to downstream tasks in a parameter- and data- efficient way, by learning the "soft prompts" to condition frozen pre-training models. Though effective, it is particularly problematic in the few-shot scenario, where prompt tuning performance is sensitive to the initialization and requires a time-consuming process to find a good initialization, thus restricting the fast adaptation ability of the pre-training models. In addition, prompt tuning could undermine the generalizability of the pre-training models, because the learnable prompt tokens are easy to overfit to the limited training samples. To address these issues, we introduce a novel Gradient-RegulAted Meta-prompt learning (GRAM) framework that jointly meta-learns an efficient soft prompt initialization for better adaptation and a lightweight gradient regulating function for strong cross-domain generalizability in a meta-learning paradigm using only the unlabeled image-text pre-training data. Rather than designing a specific prompt tuning method, our GRAM can be easily incorporated into various prompt tuning methods in a model-agnostic way, and comprehensive experiments show that GRAM brings about consistent improvement for them in several settings (i.e., few-shot learning, cross-domain generalization, cross-dataset generalization, etc.) over 11 datasets. Further, experiments show that GRAM enables the orthogonal methods of textual and visual prompt tuning to work in a mutually-enhanced way, offering better generalizability beyond the uni-modal prompt tuning methods.

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{Li_2023_ICCV, author = {Li, Juncheng and Gao, Minghe and Wei, Longhui and Tang, Siliang and Zhang, Wenqiao and Li, Mengze and Ji, Wei and Tian, Qi and Chua, Tat-Seng and Zhuang, Yueting}, title = {Gradient-Regulated Meta-Prompt Learning for Generalizable Vision-Language Models}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {2551-2562} }