- [pdf] [supp]
Relaxing Contrastiveness in Multimodal Representation Learning
Multimodal representation learning for images with paired raw texts can improve the usability and generality of the learned semantic concepts while significantly reducing annotation costs. In this paper, we explore the design space of loss functions in visual-linguistic pretraining frameworks and propose a novel Relaxed Contrastive (ReCo) objective, which acts as a drop-in replacement of the widely used InfoNCE loss. The key insight of ReCo is to allow a relaxed negative space by not penalizing unpaired multimodal samples (ie, negative pairs) that are already orthogonal or negatively correlated. Unlike the widely-used InfoNCE, which keeps repelling negative pairs as long as they are not anti-correlated, ReCo by design embraces more diversity and flexibility of the learned embeddings. We conduct extensive experiments using ReCo with state-of-the-art models by pretraining on the MIMIC-CXR dataset that consists of chest radiographs and free-text radiology reports, and evaluating on the CheXpert dataset for multimodal retrieval and disease classification. Our ReCo achieves an absolute improvement of 2.9% over the InfoNCE baseline on the CheXpert Retrieval dataset in average retrieval precision and reports better or comparable performance in the linear evaluation and finetuning for classification. We further show that ReCo outperforms InfoNCE on the Flickr30K dataset by 1.7% in retrieval Recall@1, demonstrating the generalizability of our approach to natural images.