-
[pdf]
[supp]
[bibtex]@InProceedings{Roberts_2024_CVPR, author = {Roberts, Jonathan and L\"uddecke, Timo and Sheikh, Rehan and Han, Kai and Albanie, Samuel}, title = {Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {554-563} }
Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs
Abstract
Multimodal large language models (MLLMs) have shown remarkable capabilities across a broad range of tasks but their knowledge and abilities in the geographic and geospatial domains are yet to be explored despite potential wide-ranging benefits to navigation environmental research and disaster response. We conduct a series of experiments exploring various vision capabilities of MLLMs within these domains particularly focusing on the frontier model GPT-4V and benchmark its performance against open-source counterparts. Our methodology involves challenging these models with a small-scale geographic benchmark consisting of a suite of visual tasks testing their abilities across a spectrum of complexity. The analysis uncovers not only where such models excel including instances where they outperform humans but also where they falter providing a balanced view of their capabilities in the geographic domain. To enable the comparison and evaluation of future models we publicly release our benchmark.
Related Material