Adaptive Cross-Modal Prototypes for Cross-Domain Visual-Language Retrieval
In this paper, we study the task of visual-text retrieval in the highly practical setting in which labelled visual data with paired text descriptions are available in one domain (the "source"), but only unlabelled visual data (without text descriptions) are available in the domain of interest (the "target"). We propose the ADAPTIVE CROSS-MODAL PROTOTYPES framework which seeks to enable target domain retrieval by learning cross-modal visual-text representations while minimising both uni-modal and cross-modal distribution shift across the source and target domains. Our approach is built upon two key ideas: first, we encode the inductive bias that the learned cross-modal representations should be compositional with respect to concepts in each modality--this is achieved through clustering pretrained uni-modal features across each domain and designing a careful regularisation scheme to preserve the resulting structure. Second, we employ mutual information maximisation between cross-modal representations in the source and target domains during learning--this provides a mechanism that preserves commonalities between the domains while discarding signal in each that cannot be inferred from the other. We showcase our approach for the task of cross-domain visual-text retrieval, outperforming existing approaches for both images and videos.