Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning

Zhao, Tianyi; Liu, Boyang; Gao, Yanglei; Sun, Yiming; Yuan, Maoxun; Wei, Xingxing

Tianyi Zhao, Boyang Liu, Yanglei Gao, Yiming Sun, Maoxun Yuan, Xingxing Wei; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 6364-6373

Abstract

Multi-Modal Object Detection (MMOD), due to its stronger adaptability to various complex environments, has been widely applied in various applications. Extensive research is dedicated to the RGB-IR object detection, primarily focusing on how to integrate complementary features from RGB-IR modalities. However, they neglect the mono-modality insufficient learning problem, which arises from decreased feature extraction capability in multi-modal joint learning. This leads to a prevalent but unreasonable phenomenon\textemdash Fusion Degradation, which hinders the performance improvement of the MMOD model. Motivated by this, in this paper, we introduce linear probing evaluation to the multi-modal detectors and rethink the multi-modal object detection task from the mono-modality learning perspective. Therefore, we construct a novel framework called M2D-LIF, which consists of the Mono-Modality Distillation (M2D) method and the Local Illumination-aware Fusion (LIF) module. The M2D-LIF framework facilitates the sufficient learning of mono-modality during multi-modal joint training and explores a lightweight yet effective feature fusion manner to achieve superior object detection performance. Extensive experiments conducted on three MMOD datasets demonstrate that our M2D-LIF effectively mitigates the Fusion Degradation phenomenon and outperforms the previous SOTA detectors. The codes are available at https://github.com/Zhao-Tian-yi/M2D-LIF.

Related Material

[pdf] [arXiv]

[bibtex]

@InProceedings{Zhao_2025_ICCV, author = {Zhao, Tianyi and Liu, Boyang and Gao, Yanglei and Sun, Yiming and Yuan, Maoxun and Wei, Xingxing}, title = {Rethinking Multi-modal Object Detection from the Perspective of Mono-Modality Feature Learning}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {6364-6373} }