TransFusion: Multi-Modal Fusion Network for Semantic Segmentation
The complementary properties of 2D color images and 3D point clouds can potentially improve semantic segmentation compared to using uni-modal data. Multi-modal data fusion is however challenging due to the heterogeneity, dimensionality of the data, the difficulty of aligning different modalities to the same reference frame, and the presence of modality-specific bias. In this regard, we propose a new model, TransFusion, for semantic segmentation that fuses images directly with point clouds without the need for lossy pre-processing of the point clouds. TransFusion outperforms the baseline FCN model that uses images with depth maps. Compared to the baseline, our method improved mIoU by 4% and 2% for the Vaihingen and Potsdam datasets. We demonstrate the capability of our proposed model to adequately learn the spatial and structural information resulting in better inference.