Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection

Wang, Hanshi; Gao, Jin; Hu, Weiming; Zhang, Zhipeng

Hanshi Wang, Jin Gao, Weiming Hu, Zhipeng Zhang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 26664-26674

Abstract

We present the first work demonstrating that a pure Mamba block can achieve efficient Dense Global Fusion, meanwhile guaranteeing top performance for camera-LiDAR multi-modal 3D object detection. Our motivation stems from the observation that existing fusion strategies are constrained by their inability to simultaneously achieve efficiency, long-range modeling, and retaining complete scene information. Inspired by recent advances in state-space models (SSMs) and linear attention, we leverage their linear complexity and long-range modeling capabilities to address these challenges. However, this is non-trivial since our experiments reveal that simply adopting efficient linear-complexity methods does not necessarily yield improvements and may even degrade performance. We attribute this degradation to the loss of height information during multi-modal alignment, leading to deviations in sequence order. To resolve this, we propose height-fidelity LiDAR encoding that preserves precise height information through voxel compression in continuous space, thereby enhancing camera-LiDAR alignment. Subsequently, we introduce the Hybrid Mamba Block, which leverages the enriched height-informed features to conduct local and global contextual learning. By integrating these components, our method achieves state-of-the-art performance with the top-tire NDS score of 75.0 on the nuScenes validation benchmark, even surpassing methods that utilize high-resolution inputs. Meanwhile, our method maintains efficiency, achieving faster inference speed than most recent state-of-the-art methods. Code is available at https://github.com/AutoLab-SAI-SJTU/MambaFusion

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Wang_2025_ICCV, author = {Wang, Hanshi and Gao, Jin and Hu, Weiming and Zhang, Zhipeng}, title = {Height-Fidelity Dense Global Fusion for Multi-modal 3D Object Detection}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {26664-26674} }