MonoDSSMs: Efficient Monocular 3D Object Detection with Depth-Aware State Space Models

Kiet Dang Vu, Trung Thai Tran, Duc Dung Nguyen; Proceedings of the Asian Conference on Computer Vision (ACCV), 2024, pp. 3883-3900

Abstract


Monocular 3D object detection has been an important part of autonomous driving support systems. In recent years, we have seen enormous improvement in both detection quality and runtime performance. This work presents MonoDSSM, the first to utilize the Mamba architecture to push the performance further while maintaining the detection quality. In short, our contributions are: (1) introduce Mamba-based encoder-decoder architecture to extract 3D features, and (2) propose a novel Cross-Mamba module to fuse the depth-aware features and context-aware features using the State-Space-Models (SSMs). In addition, we employ the multi-scale feature prediction strategy to enhance the predicted depth map quality. Our experiments demonstrate that the proposed architecture yields competitive performance on the KITTI dataset while significantly improving the model's effectiveness in both model size and computational cost. Our MonoDSSM achieves a comparable detection quality to the baseline, with 2.2x fewer parameters and a 1.28x faster computation time.

Related Material


[pdf]
[bibtex]
@InProceedings{Vu_2024_ACCV, author = {Vu, Kiet Dang and Tran, Trung Thai and Nguyen, Duc Dung}, title = {MonoDSSMs: Efficient Monocular 3D Object Detection with Depth-Aware State Space Models}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2024}, pages = {3883-3900} }