COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval

Haoyu Lu, Nanyi Fei, Yuqi Huo, Yizhao Gao, Zhiwu Lu, Ji-Rong Wen; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15692-15701

Abstract


Large-scale single-stream pre-training has shown dramatic performance in image-text retrieval. Regrettably, it faces low inference efficiency due to heavy attention layers. Recently, two-stream methods like CLIP and ALIGN with high inference efficiency have also shown promising performance, however, they only consider instance-level alignment between the two streams (thus there is still room for improvement). To overcome these limitations, we propose a novel COllaborative Two-Stream vision-language pre-training model termed COTS for image-text retrieval by enhancing cross-modal interaction. In addition to instance-level alignment via momentum contrastive learning, we leverage two extra levels of cross-modal interactions in our COTS: (1) Token-level interaction -- a masked vision-language modeling (MVLM) learning objective is devised without using a cross-stream network module, where variational autoencoder is imposed on the visual encoder to generate visual tokens for each image. (2) Task-level interaction -- a KL-alignment learning objective is devised between text-to-image and image-to-text retrieval tasks, where the probability distribution per task is computed with the negative queues in momentum contrastive learning. Under a fair comparison setting, our COTS achieves the highest performance among all two-stream methods and comparable performance (but with 10,800x faster in inference) w.r.t. the latest single-stream methods. Importantly, our COTS is also applicable to text-to-video retrieval, yielding new state-of-the-art on the widely-used MSR-VTT dataset.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Lu_2022_CVPR, author = {Lu, Haoyu and Fei, Nanyi and Huo, Yuqi and Gao, Yizhao and Lu, Zhiwu and Wen, Ji-Rong}, title = {COTS: Collaborative Two-Stream Vision-Language Pre-Training Model for Cross-Modal Retrieval}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {15692-15701} }