From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs

Zhang, Le; Yang, Jihan; Krishnan, Soundarya; Majmudar, Jimit; Ge, Xiou; Puri, Prasoon; Saraf, Prathamesh; Bhargava, Shruti; Piraviperumal, Dhivya; Ling, Yinan; Pan, Cindy; Yu, Hong; Agrawal, Aishwarya; Tseng, Bo-Hsiang

Le Zhang, Jihan Yang, Soundarya Krishnan, Jimit Majmudar, Xiou Ge, Prasoon Puri, Prathamesh Saraf, Shruti Bhargava, Dhivya Piraviperumal, Yinan Ling, Cindy Pan, Hong Yu, Aishwarya Agrawal, Bo-Hsiang Tseng; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 12052-12063

Abstract

Human-level agentic intelligence extends beyond low-level geometric perception, evolving from recognizing where things are to understanding what they are for. While existing benchmarks effectively evaluate the geometric perception capabilities of multimodal large language models (MLLMs), they fall short of probing the higher-order cognitive abilities required for grounded intelligence. To address this gap, we introduce the Spatial-Functional Intelligence Benchmark (SFI-Bench), a video-based benchmark with over 1,500 expert-annotated questions derived from diverse egocentric indoor video scans. SFI-Bench systematically evaluates two complementary dimensions of advanced reasoning: (1) Structured Spatial Reasoning, which requires understanding complex layouts and forming coherent spatial representations, and (2) Functional Reasoning, which involves inferring object affordances and their context-dependent utility. The benchmark includes tasks such as conditional counting, multi-hop relational reasoning, functional pairing, and knowledge-grounded troubleshooting, directly challenging models to integrate perception, memory, and inference. Our experiments reveal that current MLLMs consistently struggle to combine spatial memory with functional reasoning and external knowledge, highlighting a critical bottleneck in achieving grounded intelligence. SFI-Bench therefore provides a diagnostic tool for measuring progress toward more cognitively capable and truly grounded multimodal agents.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Zhang_2026_CVPR, author = {Zhang, Le and Yang, Jihan and Krishnan, Soundarya and Majmudar, Jimit and Ge, Xiou and Puri, Prasoon and Saraf, Prathamesh and Bhargava, Shruti and Piraviperumal, Dhivya and Ling, Yinan and Pan, Cindy and Yu, Hong and Agrawal, Aishwarya and Tseng, Bo-Hsiang}, title = {From Where Things Are to What They Are For: Benchmarking Spatial-Functional Intelligence in Multimodal LLMs}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {12052-12063} }