-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Rahmanzadehgervi_2024_ACCV, author = {Rahmanzadehgervi, Pooyan and Bolton, Logan and Taesiri, Mohammad Reza and Nguyen, Anh Totti}, title = {Vision language models are blind}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {18-34} }
Vision language models are blind
Abstract
Large language models (LLMs) with vision capabilities (e.g., GPT-4o, Gemini 1.5, and Claude 3) are powering countless image-text processing applications, enabling unprecedented multimodal, human-machine interaction. Yet, we find that all state-of-the-art LLMs fail on absurdly simple tasks such as identifying (a) whether two circles overlap or whether two lines touch each other; (b) which letter is being circled in a word; and (c) counting the number of circles in a Olympic-like logo. Our findings suggest the tokenization of input images to LLMs is the source of problem, causing failures in real-world scenarios, such as determining if two streets intersect on a Manhattan map, identifying a stock price crossing a threshold line, and describing content within a bounding box in an image.
Related Material