PMAFusion: Projection-Based Multi-Modal Alignment for 3D Semantic Occupancy Prediction

Shiyao Li, Wenming Yang, Qingmin Liao; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 3627-3634

Abstract


3D Semantic Occupancy Prediction offers a holistic scene understanding with both spatial structure and semantic analysis. Current research in this field primarily focuses on single-modal inputs relying either on images or point cloud data. The potential of combining the complementary attributes of images and point clouds has not been fully explored. Previous method transforms image features into 3D space for direct concatenation with monocular depth estimation which may introduce noises due to inaccurate depth prediction. It could also lead to substantial memory usage for explicitly constructing dense image feature volumes. To this end we propose PMAFusion an effective fusion module based on accurate multi-modal alignment. We first project the point cloud onto images using camera parameters thereby aligning each voxel with its associated pixels. A cross-attention module is then used to adaptively fuse voxel-pixel features for improved representation. In order to handle empty voxels that are difficult to obtain aligned pixels naturally we generate reference points through uniform sampling to supplement the missing spatial information. With PMAFusion We yield the best results on the nuScenes-Occupancy dataset and conduct thorough experiments to evaluate the effectiveness and efficiency of our proposed method.

Related Material


[pdf]
[bibtex]
@InProceedings{Li_2024_CVPR, author = {Li, Shiyao and Yang, Wenming and Liao, Qingmin}, title = {PMAFusion: Projection-Based Multi-Modal Alignment for 3D Semantic Occupancy Prediction}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {3627-3634} }