3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark

Wufei Ma, Haoyu Chen, Guofeng Zhang, Yu-Cheng Chou, Jieneng Chen, Celso de Melo, Alan Yuille; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 6924-6934

Abstract


3D spatial reasoning is the ability to analyze and interpret the positions, orientations, and spatial relationships of objects within the 3D space. This allows models to develop a comprehensive understanding of the 3D scene, enabling their applicability to a broader range of applications, such as autonomous navigation, robotics, and AR/VR. Despite the remarkable improvements achieved by large multi-modal models (LMMs) in a wide range of image and video understanding tasks, their abilities to perform 3D spatial reasoning are less studied. In this work we present the first comprehensive 3D spatial reasoning benchmark, 3DSRBench, with 3,000 annotated image question answering triplets from 12 question types. We balance the data distribution by collecting complimentary images that lead to opposite answers given the same question. We also adopt a novel FlipEval for robust evaluation of 3D spatial reasoning capabilities. Moreover, to study the robustness of 3D spatial reasoning w.r.t. camera 3D viewpoints, our 3DSRBench involves two subsets with 3D spatial reasoning questions on images from the same scene with common and uncommon viewpoints. We benchmark a wide range of open-sourced and proprietary LMMs, revealing their limitations in different types of 3D awareness, i.e., height, orientation, location, and multi-object reasoning. Our 3DSRBench also allows us to study the design choices of developing LMMs with strong 3D reasoning capabilities, such as the vision encoders, connectors, and training recipes.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Ma_2025_ICCV, author = {Ma, Wufei and Chen, Haoyu and Zhang, Guofeng and Chou, Yu-Cheng and Chen, Jieneng and de Melo, Celso and Yuille, Alan}, title = {3DSRBench: A Comprehensive 3D Spatial Reasoning Benchmark}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {6924-6934} }