Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning

Zhang, Wei; Wan, Chaoqun; Liu, Tongliang; Tian, Xinmei; Shen, Xu; Ye, Jieping

Wei Zhang, Chaoqun Wan, Tongliang Liu, Xinmei Tian, Xu Shen, Jieping Ye; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 18504-18515

Abstract

Extending large image-text pre-trained models (e.g. CLIP) for video understanding has made significant advancements. To enable the capability of CLIP to perceive dynamic information in videos existing works are dedicated to equipping the visual encoder with various temporal modules. However these methods exhibit "asymmetry" between the visual and textual sides with neither temporal descriptions in input texts nor temporal modules in text encoder. This limitation hinders the potential of language supervision emphasized in CLIP and restricts the learning of temporal features as the text encoder has demonstrated limited proficiency in motion understanding. To address this issue we propose leveraging "MoTion-Enhanced Descriptions" (MoTED) to facilitate the extraction of distinctive temporal features in videos. Specifically we first generate discriminative motion-related descriptions via querying GPT-4 to compare easy-confusing action categories. Then we incorporate both the visual and textual encoders with additional perception modules to process the video frames and generated descriptions respectively. Finally we adopt a contrastive loss to align the visual and textual motion features. Extensive experiments on five benchmarks show that MoTED surpasses state-of-the-art methods with convincing gaps laying a solid foundation for empowering CLIP with strong temporal modeling.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Zhang_2024_CVPR, author = {Zhang, Wei and Wan, Chaoqun and Liu, Tongliang and Tian, Xinmei and Shen, Xu and Ye, Jieping}, title = {Enhanced Motion-Text Alignment for Image-to-Video Transfer Learning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {18504-18515} }