Robust Noisy Correspondence Learning with Equivariant Similarity Consistency

Yuchen Yang, Likai Wang, Erkun Yang, Cheng Deng; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 17700-17709

Abstract


The surge in multi-modal data has propelled cross-modal matching to the forefront of research interest. However the challenge lies in the laborious and expensive process of curating a large and accurately matched multimodal dataset. Commonly sourced from the Internet these datasets often suffer from a significant presence of mismatched data impairing the performance of matching models. To address this problem we introduce a novel regularization approach named Equivariant Similarity Consistency (ESC) which can facilitate robust clean and noisy data separation and improve the training for cross-modal matching. Intuitively our method posits that the semantic variations caused by image changes should be proportional to those caused by text changes for any two matched samples. Accordingly we first calculate the ESC by comparing image and text semantic variations between a set of elaborated anchor points and other undivided training data. Then pairs with high ESC are filtered out as noisy correspondence pairs. We implement our method by combining the ESC with a traditional hinge-based triplet loss. Extensive experiments on three widely used datasets including Flickr30K MS-COCO and Conceptual Captions verify the effectiveness of our method.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Yang_2024_CVPR, author = {Yang, Yuchen and Wang, Likai and Yang, Erkun and Deng, Cheng}, title = {Robust Noisy Correspondence Learning with Equivariant Similarity Consistency}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {17700-17709} }