- [pdf] [supp]
Emphasizing Complementary Samples for Non-Literal Cross-Modal Retrieval
Existing cross-modal retrieval methods assume a straightforward relationship where images and text contain portrayals or mentions of the same objects. In contrast, real-world image-text pairs (e.g. an image and its caption in a news article) often feature more complex relations. Importantly, not all image-text pairs have the same relationship: in some pairs, image and text may be more closely aligned, while others are more loosely aligned hence complementary. In order to ensure the model learns a semantically robust space which captures nuanced relationships, care must be taken that loosely-aligned image-text pairs have a strong enough impact on learning. In this paper, we propose a novel approach to prioritize loosely-aligned samples. Unlike prior sample weighting methods, ours relies on estimating to what extent semantic similarity is preserved in the separate channels (images/text) in the learned multimodal space. In particular, the image-text pair weights in the retrieval loss focus learning towards samples from diverse or discrepant neighborhoods: samples where images or text that were close in a semantic space, are distant in the cross-modal space (diversity), or where neighbor relations are asymmetric (discrepancy). Experiments on three challenging datasets exhibiting abstract image-text relations, as well as COCO, demonstrate significant performance gains compared to recent state-of-the-art models and sample weighting approaches.