Holistic Features are almost Sufficient for Text-to-Video Retrieval

Kaibin Tian, Ruixiang Zhao, Zijie Xin, Bangxiang Lan, Xirong Li; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 17138-17147

Abstract


For text-to-video retrieval (T2VR) which aims to retrieve unlabeled videos by ad-hoc textual queries CLIP-based methods currently lead the way. Compared to CLIP4Clip which is efficient and compact state-of-the-art models tend to compute video-text similarity through fine-grained cross-modal feature interaction and matching putting their scalability for large-scale T2VR applications into doubt. We propose TeachCLIP enabling a CLIP4Clip based student network to learn from more advanced yet computationally intensive models. In order to create a learning channel to convey fine-grained cross-modal knowledge from a heavy model to the student we add to CLIP4Clip a simple Attentional frame-Feature Aggregation (AFA) block which by design adds no extra storage / computation overhead at the retrieval stage. Frame-text relevance scores calculated by the teacher network are used as soft labels to supervise the attentive weights produced by AFA. Extensive experiments on multiple public datasets justify the viability of the proposed method. TeachCLIP has the same efficiency and compactness as CLIP4Clip yet has near-SOTA effectiveness.

Related Material


[pdf]
[bibtex]
@InProceedings{Tian_2024_CVPR, author = {Tian, Kaibin and Zhao, Ruixiang and Xin, Zijie and Lan, Bangxiang and Li, Xirong}, title = {Holistic Features are almost Sufficient for Text-to-Video Retrieval}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {17138-17147} }