Spatial Representations in Multimodal AI Systems

Scott O. Murray, Bridget Leonard; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 8256-8259

Abstract


This study details how spatial information is represented within a multimodal AI system (GPT-4-turbo "GPT-4v") leveraging established methodologies from human cognitive science research. Our investigation shows both rich underlying spatial comprehension but also uncovers notable limitations. We found that the structure of spatial representation in GPT-4v is predominantly propositional diverging from the analog-like representations that are characteristic of human and animal spatial cognition. This discrepancy becomes particularly evident in tasks requiring spatial manipulation or perspective shifts where GPT-4v falls short. Our analysis aims to bridge the gap between AI and human cognition highlighting critical areas for future research and development in multimodal intelligence.

Related Material


[pdf]
[bibtex]
@InProceedings{Murray_2024_CVPR, author = {Murray, Scott O. and Leonard, Bridget}, title = {Spatial Representations in Multimodal AI Systems}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {8256-8259} }