Object Prior Embedded Network for Query-Agnostic Image Retrieval
The Text-to-Image retrieval task plays an important role in bridging the gap between vision and language modalities. This task is challenging and far from being solved, because of the large visual-semantic discrepancy between language and vision. Recent studies on vision-language contrastive learning have shown that it can effectively learn good representations from massive image-text pairs. However, most existing methods simply concatenate image and text features as input and resort to the deep netowrk to learn the visual-semantic relationship between image and text in a brute force manner. The insufficient alignments information pose a challenging weakly-supervised learning task, and results in only limited accuracy in previous methods. Motivated by the observation that the salient objects in an image can be accurately detected and are often mentioned in the paired text, in this paper, we propose a novel cross-attention transformer that uses objects detected in image as anchor points and prior to significantly ease the learning of image-text alignments, and thus boost the text-to-image search accuracy. In addition, unlike the query-dependent architectures adopted by most previous methods, our proposed method is query-agnostic and is thus significantly faster in the inference process. The extensive experiments on Flickr30K and MSCOCO captions datasets demonstrate that our proposed method can outperform the SOTA method while preserving the inference efficiency.