Learning Trajectory-Word Alignments for Video-Language Tasks

Yang, Xu; Li, Zhangzikang; Xu, Haiyang; Zhang, Hanwang; Ye, Qinghao; Li, Chenliang; Yan, Ming; Zhang, Yu; Huang, Fei; Huang, Songfang

Xu Yang, Zhangzikang Li, Haiyang Xu, Hanwang Zhang, Qinghao Ye, Chenliang Li, Ming Yan, Yu Zhang, Fei Huang, Songfang Huang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 2504-2514

Abstract

In a video, an object usually appears as the trajectory, i.e., it spans over a few spatial but longer temporal patches, that contains abundant spatiotemporal contexts. However, modern Video-Language BERTs (VDL-BERTs) neglect this trajectory characteristic that they usually follow image-language BERTs (IL-BERTs) to deploy the patch-to-word (P2W) attention that may over-exploit trivial spatial contexts and neglect significant temporal contexts. To amend this, we propose a novel TW-BERT to learn Trajectory-Word alignment by a newly designed trajectory-to-word (T2W) attention for solving video-language tasks. Moreover, previous VDL-BERTs usually uniformly sample a few frames into the model while different trajectories have diverse graininess, i.e., some trajectories span longer frames and some span shorter, and using a few frames will lose certain useful temporal contexts. However, simply sampling more frames will also make pre-training infeasible due to the largely increased training burdens. To alleviate the problem, during the fine-tuning stage, we insert a novel Hierarchical Frame-Selector (HFS) module into the video encoder. HFS gradually selects the suitable frames conditioned on the text context for the later cross-modal encoder to learn better trajectory-word alignments. By the proposed T2W attention and HFS, our TW-BERT achieves SOTA performances on text-to-video retrieval tasks, and comparable performances on video question-answering tasks with some VDL-BERTs trained on much more data. The code will be available in the supplementary material.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Yang_2023_ICCV, author = {Yang, Xu and Li, Zhangzikang and Xu, Haiyang and Zhang, Hanwang and Ye, Qinghao and Li, Chenliang and Yan, Ming and Zhang, Yu and Huang, Fei and Huang, Songfang}, title = {Learning Trajectory-Word Alignments for Video-Language Tasks}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {2504-2514} }