Multimodal Object Detection by Channel Switching and Spatial Attention
Multimodal object detection has attracted great attention in recent years since the information specific to different modalities can complement each other and effectively improve the accuracy and stability of the detection model. However, compared to processing the inputs from a single modality, fusing information from multiple modalities can significantly increase the computational complexity of the model, thus impairing its efficiency. Therefore the multimodal fusion module needs to be carefully designed to enhance the performance of the detection model while keeping the computational consumption low. In this paper, we propose a novel lightweight fusion module that can efficiently fuse the inputs from different modalities using channel switching and spatial attention (CSSA). The effectiveness and generalizability of the module are tested using two public multimodal datasets LLVIP and FLIR, both of which comprise paired infrared (IR) and visible (RGB) images. The experiments demonstrate that the proposed CSSA module can substantially improve the accuracy of multimodal object detection without consuming excessive computing resources.