SaCo Loss: Sample-wise Affinity Consistency for Vision-Language Pre-training

Wu, Sitong; Tan, Haoru; Tian, Zhuotao; Chen, Yukang; Qi, Xiaojuan; Jia, Jiaya

Sitong Wu, Haoru Tan, Zhuotao Tian, Yukang Chen, Xiaojuan Qi, Jiaya Jia; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 27358-27369

Abstract

Vision-language pre-training (VLP) aims to learn joint representations of vision and language modalities. The contrastive paradigm is currently dominant in this field. However we observe a notable misalignment phenomenon that is the affinity between samples has an obvious disparity across different modalities namely "Affinity Inconsistency Problem". Our intuition is that for a well-aligned model two images that look similar to each other should have the same level of similarity as their corresponding texts that describe them. In this paper we first investigate the reason of this inconsistency problem. We discover that the lack of consideration for sample-wise affinity consistency across modalities in existing training objectives is the central cause. To address this problem we propose a novel loss function named Sample-wise affinity Consistency (SaCo) loss which is designed to enhance such consistency by minimizing the distance between image embedding similarity and text embedding similarity for any two samples. Our SaCo loss can be easily incorporated into existing vision-language models as an additional loss due to its complementarity for most training objectives. In addition considering that pre-training from scratch is computationally expensive we also provide a more efficient way to continuously pre-train on a converged model by integrating our loss. Experimentally the model trained with our SaCo loss significantly outperforms the baseline on a variety of vision and language tasks.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Wu_2024_CVPR, author = {Wu, Sitong and Tan, Haoru and Tian, Zhuotao and Chen, Yukang and Qi, Xiaojuan and Jia, Jiaya}, title = {SaCo Loss: Sample-wise Affinity Consistency for Vision-Language Pre-training}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {27358-27369} }