Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation

Xu, Zunnan; Chen, Zhihong; Zhang, Yong; Song, Yibing; Wan, Xiang; Li, Guanbin

Zunnan Xu, Zhihong Chen, Yong Zhang, Yibing Song, Xiang Wan, Guanbin Li; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 17503-17512

Abstract

Parameter efficient tuning (PET) has received considerable attention owing to its applicability to reduce the number of parameters that need to be updated while maintaining competitive performance and providing better hardware resource savings. Although substantial progress has been made, most existing studies mainly focus on either single-modal tasks or simple classification tasks, with few works paying attention to the dense prediction tasks and the interaction between different modalities. Therefore, in this paper, we do an in-depth investigation of the efficient tuning problem on referring image segmentation. First, considering the absence of interaction between the dual encoder, we design a novel adapter named Bridger to facilitate the exchange of cross-modal information. This module also plays a role in injecting vision-specific inductive biases and task-specific information into the pre-trained model while keeping its original parameters fixed. Second, we design a lightweight decoder for referring image segmentation to make further alignment on visual and linguistic features. To perform a comprehensive assessment and promote further research, we evaluate the proposed framework on several challenging benchmarks. Experimental results illustrate the effectiveness of our approach. Updating only 1.61% to 3.38% parameters, the proposed framework gains comparable or even superior performance compared to existing full fine-tuning methods that utilize the same backbone.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Xu_2023_ICCV, author = {Xu, Zunnan and Chen, Zhihong and Zhang, Yong and Song, Yibing and Wan, Xiang and Li, Guanbin}, title = {Bridging Vision and Language Encoders: Parameter-Efficient Tuning for Referring Image Segmentation}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {17503-17512} }