-
[pdf]
[supp]
[bibtex]@InProceedings{Li_2026_CVPR, author = {Li, Zongzhao and Ma, Zongyang and Li, Mingze and Li, Songyou and Rong, Yu and Xu, Tingyang and Zhang, Ziqi and Zhao, Deli and Huang, Wenbing}, title = {STAR-R1: Multi-View Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {12041-12051} }
STAR-R1: Multi-View Spatial TrAnsformation Reasoning by Reinforcing Multimodal LLMs
Abstract
Multimodal Large Language Models (MLLMs) remain far from human-level performance in multi-view spatial reasoning, where models must establish object correspondences across view and infer coherent scene semantics. We analyze this limitation through the Transformation-Driven Visual Reasoning (TVR) task and find that Supervised Fine-Tuning (SFT) fails to capture cross-view consistency, whereas reinforcement learning (RL) fails to reliably identify key referential objects. To bridge this gap, we introduce multi-View Spatial TrAnsformation Reasoning (STAR-R1), a two-stage framework that combines process-supervised SFT with a referential-aware RL paradigm. STAR-R1 first learns structured spatial reasoning trajectories from high-quality CoTs and then uses fine-grained rewards on referential selection and answer correctness to encourage effective exploration and robust scene interpretation. Despite using only a small amount of high-quality training data, STAR-R1 surpasses state-of-the-art models with far more training data on the multi-view spatial understanding benchmarks TVR, MMSI-Bench, MindCube-Bench, and SPAR-Bench. Our study reveals the overlooked potential of RL in multi-view spatial understanding and points a way toward potentially achieving more human-like spatial reasoning in MLLMs.
Related Material

