SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, Fei Xia; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 14455-14465

Abstract


Understanding and reasoning about spatial relationships is crucial for Visual Question Answering (VQA) and robotics. Vision Language Models (VLMs) have shown impressive performance in some VQA benchmarks but struggle with 3D spatial reasoning such as recognizing distances or size differences between physical objects. This limitation may stem from a lack of 3D spatial knowledge in their training data. To address this we propose training VLMs with extensive spatial reasoning data from the internet. Our approach includes developing an automatic 3D spatial VQA data generation framework capable of creating 2 billion VQA examples from 10 million real-world images. We explore various factors in the training process such as data quality training pipeline and VLM architecture. Our work introduces the first Internet-scale 3D spatial reasoning dataset in metric space. By co-training a VLM with this dataset we significantly improve its performance in both qualitative and quantitative spatial VQA. Additionally this enhanced VLM enables new applications in chain-of-thought spatial reasoning and robotics particularly in quantitative estimation.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Chen_2024_CVPR, author = {Chen, Boyuan and Xu, Zhuo and Kirmani, Sean and Ichter, Brain and Sadigh, Dorsa and Guibas, Leonidas and Xia, Fei}, title = {SpatialVLM: Endowing Vision-Language Models with Spatial Reasoning Capabilities}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {14455-14465} }