Vision language models are blind

Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, Anh Totti Nguyen; Proceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 18-34

Abstract


Large language models (LLMs) with vision capabilities (e.g., GPT-4o, Gemini 1.5, and Claude 3) are powering countless image-text processing applications, enabling unprecedented multimodal, human-machine interaction. Yet, we find that all state-of-the-art LLMs fail on absurdly simple tasks such as identifying (a) whether two circles overlap or whether two lines touch each other; (b) which letter is being circled in a word; and (c) counting the number of circles in a Olympic-like logo. Our findings suggest the tokenization of input images to LLMs is the source of problem, causing failures in real-world scenarios, such as determining if two streets intersect on a Manhattan map, identifying a stock price crossing a threshold line, and describing content within a bounding box in an image.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Rahmanzadehgervi_2024_ACCV, author = {Rahmanzadehgervi, Pooyan and Bolton, Logan and Taesiri, Mohammad Reza and Nguyen, Anh Totti}, title = {Vision language models are blind}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {18-34} }