T2V2T: Text-to-Video-to-Text Fusion for Text-to-Video Retrieval
Video-language transformers for text-to-video retrieval typically consist of a video encoder, a text encoder, and a joint encoder. The joint encoder can be categorized into 1) self-attention-based fusion and 2) unidirectional fusion based on cross-attention. The former approach performs self-attention on the concatenation of video and text embeddings. Although it allows complete interaction between text and video, the length of the input sequences makes it computationally intensive. Instead, unidirectional fusion employs rather efficient cross-attention to fuse video embeddings into text embeddings while ignoring text-to-video interaction. The text-to-video fusion is not well explored because of the information imbalance between text and video, which makes it difficult to determine which video patches can be used as queries in cross-attention. i.e., a text embedding corresponds to one or more patch embeddings, while a video patch embedding may not correspond to any text embeddings. In order to address this challenge, we devise a Bypass cross-attention (Bypass CA) which prevents matching between irrelevant video and text embedding pairs in the cross-attention. Using Bypass CA, we propose a novel bidirectional interaction approach, Text-to-Video-to-Text (T2V2T) fusion. The proposed T2V2T uses two unidirectional fusions with opposite directions, i.e., text-to-video fusion followed by video-to-text fusion. As a result, the proposed T2V2T fusion yields state-of-the-art results on MSR-VTT, DiDeMo, and ActivityNet Captions.