MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval

Jin, Xiaojie; Zhang, Bowen; Gong, Weibo; Xu, Kai; Deng, Xueqing; Wang, Peng; Zhang, Zhao; Shen, Xiaohui; Feng, Jiashi

Xiaojie Jin, Bowen Zhang, Weibo Gong, Kai Xu, Xueqing Deng, Peng Wang, Zhao Zhang, Xiaohui Shen, Jiashi Feng; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 27144-27153

Abstract

State-of-the-art video-text retrieval (VTR) methods typically involve fully fine-tuning a pre-trained model (e.g. CLIP) on specific datasets. However this can result in significant storage costs in practical applications as a separate model per task must be stored. To address this issue we present our pioneering work that enables parameter-efficient VTR using a pre-trained model with only a small number of tunable parameters during training. Towards this goal we propose a new method dubbed Multimodal Video Adapter (MV-Adapter) for efficiently transferring the knowledge in the pre-trained CLIP from image-text to video-text. Specifically MV-Adapter utilizes bottleneck structures in both video and text branches along with two novel components. The first is a Temporal Adaptation Module that is incorporated in the video branch to introduce global and local temporal contexts. We also train weights calibrations to adjust to dynamic variations across frames. The second is Cross Modality Tying that generates weights for video/text branches through sharing cross modality factors for better aligning between modalities. Thanks to above innovations MV-Adapter can achieve comparable or better performance than standard fine-tuning with negligible parameters overhead. Notably MV-Adapter consistently outperforms various competing methods in V2T/T2V tasks with large margins on five widely used VTR benchmarks (MSR-VTT MSVD LSMDC DiDemo and ActivityNet). Codes will be released.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Jin_2024_CVPR, author = {Jin, Xiaojie and Zhang, Bowen and Gong, Weibo and Xu, Kai and Deng, Xueqing and Wang, Peng and Zhang, Zhao and Shen, Xiaohui and Feng, Jiashi}, title = {MV-Adapter: Multimodal Video Transfer Learning for Video Text Retrieval}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {27144-27153} }