Relaxing Binary Constraints in Contrastive Vision-Language Medical Representation Learning

Xiaoyang Wei, Camille Kurtz, Florence Cloppet; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 4462-4471

Abstract


By aligning paired image and caption embeddings as input contrastive vision-language representation learning has witnessed significant advances as illustrated by CLIP allowing visual encoders to learn from textual supervision and vice versa. Benefiting from millions of image-caption pairs collected from the Internet CLIP-like models show competitive performances against fully supervised baselines. However the learned visual representations are still undermined due to the binary constraint as most contrastive learning frameworks follow strict one-to-one correspondence for the input pairs of data and optimize the models using the InfoNCE loss function. The embeddings of the paired image-text are aligned while the unpaired image-text are pushed away from each other. In fact there are naturally many "false negatives" among these negative pairs since unpaired data can also have a high similarity. In this work we aim to overcome the impact of false negatives in vision-language representation learning by introducing soft targets for estimating the similarity between unpaired images and texts using external semantic knowledge structured in the form of graphs. The interest of such a method is demonstrated in the application context of medical imaging.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Wei_2025_WACV, author = {Wei, Xiaoyang and Kurtz, Camille and Cloppet, Florence}, title = {Relaxing Binary Constraints in Contrastive Vision-Language Medical Representation Learning}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {4462-4471} }