MIVC: Multiple Instance Visual Component for Visual-Language Models

Wenyi Wu, Qi Li, Wenliang Zhong, Junzhou Huang; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 8117-8126

Abstract


Vision-language models have been widely explored across a wide range of tasks and achieve satisfactory performance. However, it's under-explored how to consolidate entity understanding through a varying number of images and to align it with the pre-trained language models for generative tasks. In this paper, we propose MIVC, a general multiple instance visual component to bridge the gap between various image inputs with off-the-shelf vision-language models by aggregating visual representations in a permutation-invariant fashion through a neural network. We show that MIVC could be plugged into the visual-language models to improve the model performance consistently on visual question answering, classification and captioning tasks on a public available e-commerce dataset with multiple images per product. Furthermore, we show that the component provides insight into the contribution of each image to the downstream tasks.

Related Material


[pdf]
[bibtex]
@InProceedings{Wu_2024_WACV, author = {Wu, Wenyi and Li, Qi and Zhong, Wenliang and Huang, Junzhou}, title = {MIVC: Multiple Instance Visual Component for Visual-Language Models}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {8117-8126} }