From Within to Between: Knowledge Distillation for Cross Modality Retrieval

Vinh Tran, Niranjan Balasubramanian, Minh Hoai; Proceedings of the Asian Conference on Computer Vision (ACCV), 2022, pp. 3223-3240

Abstract


We propose a novel loss function for training text-to-video and video-to-text retrieval networks based on knowledge distillation. This loss function addresses an important drawback of the max-margin loss function often used in existing cross-modality retrieval methods, in which a fixed margin is used in training to separate matching video-and-caption pairs from non-matching pairs, treating all non-matching pairs the same and failing to account for the different degrees of non-matching. We address this drawback by introducing a novel loss for the non-matching pairs; this loss leverages the knowledge within one domain to train a better network for matching between two domains. This proposed loss does not require extra annotation. It is complementary to the existing max-margin loss, and it can be integrated into the training pipeline of any cross-modality retrieval method. Experimental results on four cross-modal retrieval datasets namely MSRVTT, ActivityNet, DiDeMo, and MSVD show the effectiveness of the proposed method.

Related Material


[pdf] [supp] [code]
[bibtex]
@InProceedings{Tran_2022_ACCV, author = {Tran, Vinh and Balasubramanian, Niranjan and Hoai, Minh}, title = {From Within to Between: Knowledge Distillation for Cross Modality Retrieval}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2022}, pages = {3223-3240} }