MMA: Multi-Modal Adapter for Vision-Language Models

Lingxiao Yang, Ru-Yuan Zhang, Yanchen Wang, Xiaohua Xie; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 23826-23837

Abstract


Pre-trained Vision-Language Models (VLMs) have served as excellent foundation models for transfer learning in diverse downstream tasks. However tuning VLMs for few-shot generalization tasks faces a discrimination -- generalization dilemma i.e. general knowledge should be preserved and task-specific knowledge should be fine-tuned. How to precisely identify these two types of representations remains a challenge. In this paper we propose a Multi-Modal Adapter (MMA) for VLMs to improve the alignment between representations from text and vision branches. MMA aggregates features from different branches into a shared feature space so that gradients can be communicated across branches. To determine how to incorporate MMA we systematically analyze the discriminability and generalizability of features across diverse datasets in both the vision and language branches and find that (1) higher layers contain discriminable dataset-specific knowledge while lower layers contain more generalizable knowledge and (2) language features are more discriminable than visual features and there are large semantic gaps between the features of the two modalities especially in the lower layers. Therefore we only incorporate MMA to a few higher layers of transformers to achieve an optimal balance between discrimination and generalization. We evaluate the effectiveness of our approach on three tasks: generalization to novel classes novel target datasets and domain generalization. Compared to many state-of-the-art methods our MMA achieves leading performance in all evaluations. Code is at https://github.com/ZjjConan/Multi-Modal-Adapter

Related Material


[pdf]
[bibtex]
@InProceedings{Yang_2024_CVPR, author = {Yang, Lingxiao and Zhang, Ru-Yuan and Wang, Yanchen and Xie, Xiaohua}, title = {MMA: Multi-Modal Adapter for Vision-Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {23826-23837} }