Alignment and Generation Adapter for Efficient Video-Text Understanding

Han Fang, Zhifei Yang, Yuhan Wei, Xianghao Zang, Chao Ban, Zerun Feng, Zhongjiang He, Yongxiang Li, Hao Sun; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2023, pp. 2791-2797


Pre-trained models have demonstrated considerable performance, especially in enhancing cross-modal understanding between videos and text. However, fine-tuning them at scale becomes costly and poses challenges for adapting to various downstream tasks. To tackle these challenges, we propose the Alignment-generation Adapter (AGAdapter), establishing semantic coherence between alignment and generation models for efficient video-text adaptation across multiple tasks simultaneously. We propose an alignment adapter with knowledge-sharing to adapt the frozen CLIP model for fine-grained video-language interaction. Additionally, we introduce the generation adapter with prompt tuning to leverage the large language model for captioning. Furthermore, we introduce instruction joint tuning, combining textual and cross-modal instructions, to capture detailed descriptions. Our AGAdapter achieves state-of-the-art performance on video-text retrieval and video captioning tasks, including two benchmarks, MSR-VTT and ActivityNet

Related Material

@InProceedings{Fang_2023_ICCV, author = {Fang, Han and Yang, Zhifei and Wei, Yuhan and Zang, Xianghao and Ban, Chao and Feng, Zerun and He, Zhongjiang and Li, Yongxiang and Sun, Hao}, title = {Alignment and Generation Adapter for Efficient Video-Text Understanding}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2023}, pages = {2791-2797} }