Pixel-Aligned Language Model

Xu, Jiarui; Zhou, Xingyi; Yan, Shen; Gu, Xiuye; Arnab, Anurag; Sun, Chen; Wang, Xiaolong; Schmid, Cordelia

Jiarui Xu, Xingyi Zhou, Shen Yan, Xiuye Gu, Anurag Arnab, Chen Sun, Xiaolong Wang, Cordelia Schmid; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 13030-13039

Abstract

Large language models have achieved great success in recent years so as their variants in vision. Existing vision-language models can describe images in natural languages answer visual-related questions or perform complex reasoning about the image. However it is yet unclear how localization tasks such as word grounding or referring localization can be performed using large language models. In this work we aim to develop a vision-language model that can take locations for example a set of points or boxes as either inputs or outputs. When taking locations as inputs the model performs location-conditioned captioning which generates captions for the indicated object or region. When generating locations as outputs our model regresses pixel coordinates for each output word generated by the language model and thus performs dense word grounding. Our model is pre-trained on the Localized Narrative dataset which contains pixel-word-aligned captioning from human attention. We show our model can be applied to various location-aware vision-language tasks including referring localization location-conditioned captioning and dense object captioning archiving state-of-the-art performance on RefCOCO and Visual Genome.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Xu_2024_CVPR, author = {Xu, Jiarui and Zhou, Xingyi and Yan, Shen and Gu, Xiuye and Arnab, Anurag and Sun, Chen and Wang, Xiaolong and Schmid, Cordelia}, title = {Pixel-Aligned Language Model}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {13030-13039} }