-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Zhang_2024_CVPR, author = {Zhang, Chenhui and Wang, Sherrie}, title = {Good at Captioning Bad at Counting: Benchmarking GPT-4V on Earth Observation Data}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {7839-7849} }
Good at Captioning Bad at Counting: Benchmarking GPT-4V on Earth Observation Data
Abstract
Large Vision-Language Models (VLMs) have demonstrated impressive performance on complex tasks involving visual input with natural language instructions. However it remains unclear to what extent capabilities on natural images transfer to Earth observation (EO) data which are predominantly satellite and aerial images less common in VLM training data. In this work we propose a comprehensive benchmark to gauge the progress of VLMs toward being useful tools for EO data by assessing their abilities on scene understanding localization and counting and change detection. Motivated by real-world applications our benchmark includes scenarios like urban monitoring disaster relief land use and conservation. We discover that although state-of-the-art VLMs like GPT-4V possess world knowledge that leads to strong performance on location understanding and image captioning their poor spatial reasoning limits usefulness on object localization and counting. Our benchmark leaderboard and evaluation suite are available at https://vleo.danielz.ch/. A full version of this paper is available at https://arxiv.org/abs/2401.17600.
Related Material