PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs

Michael Dorkenwald, Nimrod Barazani, Cees G. M. Snoek, Yuki M. Asano; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13548-13558

Abstract


Vision-Language Models (VLMs) such as Flamingo and GPT-4V have shown immense potential by integrating large language models with vision systems. Nevertheless these models face challenges in the fundamental computer vision task of object localisation due to their training on multimodal data containing mostly captions without explicit spatial grounding. While it is possible to construct custom supervised training pipelines with bounding box annotations that integrate with VLMs these result in specialized and hard-to-scale models. In this paper we aim to explore the limits of caption-based VLMs and instead propose to tackle the challenge in a simpler manner by i) keeping the weights of a caption-based VLM frozen and ii) not using any supervised detection data. To this end we introduce an input-agnostic Positional Insert (PIN) a learnable spatial prompt containing a minimal set of parameters that are slid inside the frozen VLM unlocking object localisation capabilities. Our PIN module is trained with a simple next-token prediction task on synthetic data without requiring the introduction of new output heads. Our experiments demonstrate strong zero-shot localisation performances on a variety of images including Pascal VOC COCO LVIS and diverse images like paintings or cartoons.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Dorkenwald_2024_CVPR, author = {Dorkenwald, Michael and Barazani, Nimrod and Snoek, Cees G. M. and Asano, Yuki M.}, title = {PIN: Positional Insert Unlocks Object Localisation Abilities in VLMs}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {13548-13558} }