SST-VLM: Sparse Sampling-Twice Inspired Video-Language Model
Most existing video-language modeling methods densely sample dozens (or even hundreds) of video clips from each raw video to learn the video representation for text-to-video retrieval. This paradigm requires high computational overload. Therefore, sparse sampling-based methods are proposed recently, which only sample a handful of video clips with short time duration from each raw video. However, they still struggle to learn a reliable video embedding with fragmented clips per raw video. To overcome this challenge, we present a novel video-language model called SST-VLM inspired by a Sparse Sampling-Twice (SST) strategy, where each raw video is represented with only two holistic video clips (each has a few frames, but throughout the entire video). For training our SST-VLM, we propose a new Dual Cross-modal MoCo (Dual X-MoCo) algorithm, which includes two cross-modal MoCo modules to respectively model the two clip-text pairs (for each video-text input). In addition to the classic cross-modal contrastive objective, we devise a clip-level alignment objective to obtain more consistent retrieval performance by aligning the prediction distributions of the two video clips (based on the negative queues of MoCo). Extensive experiments show that our SST-VLM achieves new state-of-the-art in text-to-video retrieval.