Towards Better Vision-Inspired Vision-Language Models

Cao, Yun-Hao; Ji, Kaixiang; Huang, Ziyuan; Zheng, Chuanyang; Liu, Jiajia; Wang, Jian; Chen, Jingdong; Yang, Ming

Yun-Hao Cao, Kaixiang Ji, Ziyuan Huang, Chuanyang Zheng, Jiajia Liu, Jian Wang, Jingdong Chen, Ming Yang; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13537-13547

Abstract

Vision-language (VL) models have achieved unprecedented success recently in which the connection module is the key to bridge the modality gap. Nevertheless the abundant visual clues are not sufficiently exploited in most existing methods. On the vision side most existing approaches only use the last feature of the vision tower without using the low-level features. On the language side most existing methods only introduce shallow vision-language interactions. In this paper we present a vision-inspired vision-language connection module dubbed as VIVL which efficiently exploits the vision cue for VL models. To take advantage of the lowerlevel information from the vision tower a feature pyramid extractor (FPE) is introduced to combine features from different intermediate layers which enriches the visual cue with negligible parameters and computation overhead. To enhance VL interactions we propose deep vision-conditioned prompts (DVCP) that allows deep interactions of vision and language features efficiently. Our VIVL exceeds the previous state-of-the-art method by 18.1 CIDEr when training from scratch on the COCO caption task which greatly improves the data efficiency. When used as a plug-in module VIVL consistently improves the performance for various backbones and VL frameworks delivering new state-of-the-art results on multiple benchmarks e.g. NoCaps and VQAv2.

Related Material

[pdf]

[bibtex]

@InProceedings{Cao_2024_CVPR, author = {Cao, Yun-Hao and Ji, Kaixiang and Huang, Ziyuan and Zheng, Chuanyang and Liu, Jiajia and Wang, Jian and Chen, Jingdong and Yang, Ming}, title = {Towards Better Vision-Inspired Vision-Language Models}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {13537-13547} }