VL-Match: Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching

Junyu Bi, Daixuan Cheng, Ping Yao, Bochen Pang, Yuefeng Zhan, Chuanguang Yang, Yujing Wang, Hao Sun, Weiwei Deng, Qi Zhang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 2584-2593

Abstract


Vision-Language Pretraining (VLP) has significantly improved the performance of various vision-language tasks with the matching of images and texts. In this paper, we propose VL-Match, a Vision-Language framework with Enhanced Token-level and Instance-level Matching. At the token level, a Vision-Language Replaced Token Detection task is designed to boost the substantial interaction between text tokens and images, where the text encoder of VLP works as a generator to generate a corrupted text, and the multimodal encoder of VLP works as a discriminator to predict whether each text token in the corrupted text matches the image. At the instance level, in the Image-Text Matching task that judges whether an image-text pair is matched, we propose a novel bootstrapping method to generate hard negative text samples that are different from the positive ones only at the token level. In this way, we can force the network to detect fine-grained differences between images and texts. Notably, with a smaller amount of parameters, VL-Match significantly outperforms previous SOTA on all image-text retrieval tasks.

Related Material


[pdf]
[bibtex]
@InProceedings{Bi_2023_ICCV, author = {Bi, Junyu and Cheng, Daixuan and Yao, Ping and Pang, Bochen and Zhan, Yuefeng and Yang, Chuanguang and Wang, Yujing and Sun, Hao and Deng, Weiwei and Zhang, Qi}, title = {VL-Match: Enhancing Vision-Language Pretraining with Token-Level and Instance-Level Matching}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {2584-2593} }