Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs

Jonathan Roberts, Timo Lüddecke, Rehan Sheikh, Kai Han, Samuel Albanie; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 554-563


Multimodal large language models (MLLMs) have shown remarkable capabilities across a broad range of tasks but their knowledge and abilities in the geographic and geospatial domains are yet to be explored despite potential wide-ranging benefits to navigation environmental research and disaster response. We conduct a series of experiments exploring various vision capabilities of MLLMs within these domains particularly focusing on the frontier model GPT-4V and benchmark its performance against open-source counterparts. Our methodology involves challenging these models with a small-scale geographic benchmark consisting of a suite of visual tasks testing their abilities across a spectrum of complexity. The analysis uncovers not only where such models excel including instances where they outperform humans but also where they falter providing a balanced view of their capabilities in the geographic domain. To enable the comparison and evaluation of future models we publicly release our benchmark.

Related Material

[pdf] [supp]
@InProceedings{Roberts_2024_CVPR, author = {Roberts, Jonathan and L\"uddecke, Timo and Sheikh, Rehan and Han, Kai and Albanie, Samuel}, title = {Charting New Territories: Exploring the Geographic and Geospatial Capabilities of Multimodal LLMs}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {554-563} }