ACMM: Aligned Cross-Modal Memory for Few-Shot Image and Sentence Matching

Yan Huang, Liang Wang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 5774-5783

Abstract


Image and sentence matching has drawn much attention recently, but due to the lack of sufficient pairwise data for training, most previous methods still cannot well associate those challenging pairs of images and sentences containing rarely appeared regions and words, i.e., few-shot content. In this work, we study this challenging scenario as few-shot image and sentence matching, and accordingly propose an Aligned Cross-Modal Memory (ACMM) model to memorize the rarely appeared content. Given a pair of image and sentence, the model first includes an aligned memory controller network to produce two sets of semantically-comparable interface vectors through cross-modal alignment. Then the interface vectors are used by modality-specific read and update operations to alternatively interact with shared memory items. The memory items persistently memorize cross-modal shared semantic representations, which can be addressed out to better enhance the representation of few-shot content. We apply the proposed model to both conventional and few-shot image and sentence matching tasks, and demonstrate its effectiveness by achieving the state-of-the-art performance on two benchmark datasets.

Related Material


[pdf]
[bibtex]
@InProceedings{Huang_2019_ICCV,
author = {Huang, Yan and Wang, Liang},
title = {ACMM: Aligned Cross-Modal Memory for Few-Shot Image and Sentence Matching},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
month = {October},
year = {2019}
}