-
[pdf]
[supp]
[arXiv]
[bibtex]@InProceedings{Wu_2026_CVPR, author = {Wu, Haoning and Huang, Xiao and Chen, Yaohui and Zhang, Ya and Wang, Yanfeng and Xie, Weidi}, title = {SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {31029-31041} }
SpatialScore: Towards Comprehensive Evaluation for Spatial Intelligence
Abstract
Existing evaluations of multimodal large language models (MLLMs) on spatial intelligence are typically fragmented and limited in scope. In this work, we conduct a holistic assessment of the spatial understanding abilities of modern MLLMs and propose complementary data-driven and agent-based solutions. Concretely, we make the following contributions: (i) we propose SpatialScore, the most comprehensive and diverse multimodal spatial intelligence benchmark to date, encompassing various visual data types, input modalities, and question-answering formats with approximately 5K manually verified samples across 30 distinct tasks; (ii) we construct SpatialCorpus, a large-scale training resource with 331K multimodal QA samples for supervised fine-tuning Qwen3-VL on spatial understanding; (iii) we develop SpaitalAgent, a multi-agent system incorporating 12 specialized spatial perception tools, supporting both Plan-Execute and ReAct reasoning paradigms, enabling to improve spatial reasoning in a training-free manner; and (iv) we conduct extensive evaluations on 40 representative MLLMs, revealing persistent challenges in spatial intelligence while demonstrating the effectiveness of our data-driven and agent-based solutions. All data, code, and models will be publicly available.
Related Material

