Mamba-3VL: Taming State Space Model for 3D Vision Language Learning

Wang, Yuan; Chen, Yuxin; Qi, Zhongang; Liu, Lijun; Jiao, Jile; Feng, Xuetao; Liang, Yujia; Shan, Ying; Zhang, Zhipeng

Yuan Wang, Yuxin Chen, Zhongang Qi, Lijun Liu, Jile Jiao, Xuetao Feng, Yujia Liang, Ying Shan, Zhipeng Zhang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 6273-6283

Abstract

3D vision-language (3D-VL) reasoning, connecting natural language with 3D physical world, represents a milestone in advancing spatial intelligence. While transformer-based methods dominate 3D-VL research, their quadratic complexity and simplistic positional embedding mechanisms severely limits effective modeling of long-range 3D-VL dependencies and spatial relationships in 3D-VL tasks. State Space Models (SSM) have emerged as promising linear-complexity alternatives for sequential data processing, while inherent selection mechanism offers notable capability for spatial modeling. Despite its potential, straightforward adoption of Mamba to 3D-VL tasks encounters two obstacles: (1) how to perceive the position of 3D objects and understand complex spatial relationships, and (2) how to achieve thorough synergies of multi-modal features. In this paper, we propose Mamba-3VL, a pioneering 3D-VL framework to model complex intra- and inter-modality correlations and enhance spatial relation reasoning, while guaranteeing top-tier performance, high efficiency, and generalization potential for 3D-VL tasks. Specifically, Mamba Mixer explicitly models 3D-VL interaction via channel twisting and relation-prioritized spatial scanning policy. It maximally retain spatial relation of object-centric features. To further provide precise spatial encoding for mamba, we develop Instance-aware Dynamic Position Adapter (IDPA) to dynamically adjust instance-specific positional embeddings and enhance local spatial relation of 3D objects. Extensive results validate Mamba-3VL trumps other competitors on seven 3D-VL benchmarks and showcases versatile potentials for challenging Embodied AI tasks.

Related Material

[pdf]

[bibtex]

@InProceedings{Wang_2025_ICCV, author = {Wang, Yuan and Chen, Yuxin and Qi, Zhongang and Liu, Lijun and Jiao, Jile and Feng, Xuetao and Liang, Yujia and Shan, Ying and Zhang, Zhipeng}, title = {Mamba-3VL: Taming State Space Model for 3D Vision Language Learning}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {6273-6283} }