Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pre-training

Bumsoo Kim, Yeonsik Jo, Jinhyung Kim, Seunghwan Kim; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 2563-2572

Abstract


Contrastive Language-Image Pretraining has emerged as a prominent approach for training vision and text encoders with uncurated image-text pairs from the web. To enhance data-efficiency, recent efforts have introduced additional supervision terms that involve random-augmented views of the image. However, since the image augmentation process is unaware of its text counterpart, this procedure could cause various degrees of image-text misalignments during training. Prior methods either disregarded this discrepancy or introduced external models to mitigate the impact of misalignments during training. In contrast, we propose a novel metric learning approach that capitalizes on these misalignments as an additional training source, which we term "Misalign, Contrast then Distill (MCD)". Unlike previous methods that treat augmented images and their text counterparts as simple positive pairs, MCD predicts the continuous scales of misalignment caused by the augmentation. Our extensive experimental results show that our proposed MCD achieves state-of-the-art transferability in multiple classification and retrieval downstream datasets.

Related Material


[pdf]
[bibtex]
@InProceedings{Kim_2023_ICCV, author = {Kim, Bumsoo and Jo, Yeonsik and Kim, Jinhyung and Kim, Seunghwan}, title = {Misalign, Contrast then Distill: Rethinking Misalignments in Language-Image Pre-training}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {2563-2572} }