CAFF-DINO: Multi-spectral Object Detection Transformers with Cross-attention Features Fusion

Kevin Helvig, Baptiste Abeloos, Pauline Trouvé-Peloux; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 3037-3046


Object detection on images can find benefit from coupling multiple spectra each presenting specific useful features. However building an efficient architecture coupling the different modalities is a complex task. Transformers due to their ability to extract meaningful correlations between the different regions of the inputs appear as a promising way to perform features fusion across different spectra. This work presents a multi-spectral object detection architecture based on cross-attention features fusion (CAFF) combined with a transformer based detector (DINO). We demonstrate here the performance of the proposed approach in object detection compared with state-of-the-art approaches on infrared-visible multi-spectral datasets. Moreover the robustness to systematic misalignment between image pairs is studied. The proposed approach is generic to any mono-spectrum transformer based detectors. The model developed in this study will be available in a dedicated github repository.

Related Material

@InProceedings{Helvig_2024_CVPR, author = {Helvig, Kevin and Abeloos, Baptiste and Trouv\'e-Peloux, Pauline}, title = {CAFF-DINO: Multi-spectral Object Detection Transformers with Cross-attention Features Fusion}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {3037-3046} }