VisionCube: 3D-Aware Vision-Language Model for Multi-Step Spatial Reasoning

Feiyang Wang, Nan Luo, Wangyu Wu; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 3270-3279

Abstract


Solving Rubik's Cube efficiently requires advanced spatial reasoning, sequential planning, and adaptive decision-making. Traditional solvers rely on predefined algorithms and hand-crafted heuristics, limiting their generalizability across diverse cube states. In this work, we introduce VisionCube, a multimodal embodied AI system designed for Rubik's Cube solving. VisionCube incorporates multi-view spatial reasoning, geometric priors, and cross-modal fusion to enhance its understanding of 3D cube transformations. To support this, we construct CubeCoT, a dataset containing annotated Rubik's Cube states and structured multi-step solving trajectories at three difficulty levels. VisionCube employs a Dual-Loop VisionCoT framework for iterative reasoning and a Memory Stream to improve long-horizon planning.We integrate 3D feature extraction via Instant-NGP, PointNet, and Point Transformer, ensuring robust spatial perception. Our model achieves 100% accuracy on low- and medium-difficulty tasks and 80% on high-difficulty tasks, significantly outperforming MiniGPT-4 and LLaVA by (35--60% in accuracy for complex multi-step planning).

Related Material


[pdf]
[bibtex]
@InProceedings{Wang_2025_CVPR, author = {Wang, Feiyang and Luo, Nan and Wu, Wangyu}, title = {VisionCube: 3D-Aware Vision-Language Model for Multi-Step Spatial Reasoning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2025}, pages = {3270-3279} }