DSTR: Dual Scenes Transformer for Cross-Modal Fusion in 3D Object Detection

Cai, Haojie; Yin, Dongfu; Yu, Fei Richard; Xiong, Siting

Haojie Cai, Dongfu Yin, Fei Richard Yu, Siting Xiong; Proceedings of the Winter Conference on Applications of Computer Vision (WACV), 2025, pp. 3064-3073

Abstract

Increasing attention has been garnered by LiDAR points and multi-view images fusion based on Transformer to supplement another modality in 3D object detection. However most current methods perform data fusion based on the entire scene which entails substantial redundant background information and lacks fine-grained local details of the foreground objects to be detected. Furthermore global scene fusion results in coarse fusion granularity and the excessive redundancy leads to slow convergence and reduced accuracy. In this work a novel Dual Scenes Transformer pipeline (DSTR) which comprises a Global-Scene Integration (GSI) module Local-Scene Integration (LSI) module and Dual Scenes Fusion (DSF) module is presented to tackle the above challenge. Concretely features from point clouds and images are utilized for gathering the global scene information in GSI. The insufficiency issues of global scene fusion are addressed by extracting local instance features for both modalities in LSI supplementing GSI in a more fine-grained way. Furthermore DSF is proposed to aggregate the local scene to the global scene which fully explores dual-modal information. Experiments on the nuScenes dataset show that our DSTR has state-of-the-art (SOTA) performance in certain 3D object detection benchmark categories on validation and test sets.

Related Material

[pdf]

[bibtex]

@InProceedings{Cai_2025_WACV, author = {Cai, Haojie and Yin, Dongfu and Yu, Fei Richard and Xiong, Siting}, title = {DSTR: Dual Scenes Transformer for Cross-Modal Fusion in 3D Object Detection}, booktitle = {Proceedings of the Winter Conference on Applications of Computer Vision (WACV)}, month = {February}, year = {2025}, pages = {3064-3073} }