- [pdf] [supp] [code]
From Within to Between: Knowledge Distillation for Cross Modality Retrieval
We propose a novel loss function for training text-to-video and video-to-text retrieval networks based on knowledge distillation. This loss function addresses an important drawback of the max-margin loss function often used in existing cross-modality retrieval methods, in which a fixed margin is used in training to separate matching video-and-caption pairs from non-matching pairs, treating all non-matching pairs the same and failing to account for the different degrees of non-matching. We address this drawback by introducing a novel loss for the non-matching pairs; this loss leverages the knowledge within one domain to train a better network for matching between two domains. This proposed loss does not require extra annotation. It is complementary to the existing max-margin loss, and it can be integrated into the training pipeline of any cross-modality retrieval method. Experimental results on four cross-modal retrieval datasets namely MSRVTT, ActivityNet, DiDeMo, and MSVD show the effectiveness of the proposed method.