VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control

Zi-Yuan Hu, Yanyang Li, Michael R. Lyu, Liwei Wang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 3010-3020

Abstract


As the model size of pre-trained language models (PLMs) grows rapidly, full fine-tuning becomes prohibitively expensive for model training and storage. In vision-and-language (VL), parameter-efficient tuning (PET) techniques are proposed to integrate modular modifications (e.g., Adapter) into encoder-decoder PLMs. By tuning a small set of trainable parameters, these techniques perform on par with full fine-tuning. However, excessive modular modifications and neglecting the unique abilities of the encoders and decoders can lead to performance degradation, while existing PET techniques (e.g., VL-Adapter) overlook these issues. In this paper, we propose a Vision-and-Language Parameter-Efficient Tuning (VL-PET) framework to impose effective control over modular modifications via a novel granularity-controlled mechanism. Considering different granularity-controlled matrices generated by this mechanism, a variety of model-agnostic VL-PET modules can be instantiated from our framework for better efficiency and effectiveness trade-offs. We further propose lightweight designs to enhance VL alignment and modeling for the encoders and maintain text generation for the decoders. Extensive experiments conducted on four image-text tasks and four video-text tasks demonstrate the efficiency, effectiveness, scalability and transferability of our VL-PET framework. In particular, our VL-PET-large significantly outperforms full fine-tuning by 2.39% (2.61%) and VL-Adapter by 2.92% (3.41%) with BART-base (T5-base) on image-text tasks, while utilizing fewer trainable parameters. Furthermore, we validate the enhanced effect of employing our VL-PET designs (e.g., granularity-controlled mechanism and lightweight designs) on existing PET techniques, enabling them to achieve significant performance improvements.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Hu_2023_ICCV, author = {Hu, Zi-Yuan and Li, Yanyang and Lyu, Michael R. and Wang, Liwei}, title = {VL-PET: Vision-and-Language Parameter-Efficient Tuning via Granularity Control}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {3010-3020} }