-
[pdf]
[bibtex]@InProceedings{He_2024_ACCV, author = {He, Haichen and Liu, Weibin and Xing, Weiwei}, title = {BiEfficient: Bidirectionally Prompting Vision-Language Models for Parameter-Efficient Video Recognition}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {108-125} }
BiEfficient: Bidirectionally Prompting Vision-Language Models for Parameter-Efficient Video Recognition
Abstract
Vision-language models (VLMs) pre-trained on large-scale image-text pairs have shown great success in various image tasks. However, how to efficiently transfer such powerful VLMs into video domain is still an open problem. Given that full finetuning VLMs for video tasks could be computationally expensive, recent studies turn their focus on parameter-efficient finetuning (PEFT). The great potential of VLMs lies in leveraging the bidirectional semantic connections between the two modalities of vision and language. Nevertheless, most current PEFT methods use the vision-only framework and usually ignore the semantic connections between vision and language. In this paper, we propose a novel method called BiEfficient, which use bidirectional prompting schemes to efficiently transfer the VLM to video recognition task with a small number of tunable parameters: 1) Vision-to-Language: we propose two prompt mechanisms, Pre-Prompt and Post-Prompt, which act before and after the text encoder respectively to generate discriminative video-level text representation for each input video. 2) Language-to-Vision: we propose Word-Guided Visual-Prompt, which enhances the temporal modeling of videos using textual knowledge in an almost parameter-free manner. Experiments on Kinetics-400, UCF-101, HMDB-51 demonstrate that the proposed method can achieve comparable or even better performance to the full finetuning methods with much fewer tunable parameters across closed-set and zero-shot video recognition benchmarks.
Related Material