SMSTracker: Tri-path Score Mask Sigma Fusion for Multi-Modal Tracking

Sixian Chan, Zedong Li, Wenhao Li, Shijian Lu, Chunhua Shen, Xiaoqin Zhang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 4766-4775

Abstract


Multi-modal object tracking has emerged as a significant research focus in computer vision due to its robustness in complex environments, such as exposure variations, blur, and occlusions. Despite existing studies integrating supplementary modal information into pre-trained RGB trackers through visual prompt mechanisms, this approach exhibits a critical limitation: it inherently prioritizes RGB information as the dominant modality, thereby underutilizing the complementary information of alternative modalities. To address this fundamental limitation, we present SMSTracker, an innovative tri-path score mask sigma fusion framework for multi-modal tracking, including three key modules. Firstly, we design a tri-path Score Mask Fusion (SMF) module to evaluate and quantify the reliability of each modality, allowing optimal exploitation of complementary features between modalities. Secondly, we introduce a pioneering Sigma Interaction (SGI) module to facilitate a sophisticated fusion of modal features across tri-branches. Furthermore, we advance a Drop Key Fine-tuning (DKF) strategy to address the inherent challenge of unequal data contribution in multi-modal learning scenarios, thereby enhancing the model's capacity for comprehensive multi-modal information processing. Finally, extensive experiments on RGB+Thermal, RGB+Depth, and RGB+Event datasets demonstrate the significant performance improvements achieved by SMSTracker over existing state-of-the-art methods. Code and model are available at https://github.com/Leezed525/SMSTracker.

Related Material


[pdf]
[bibtex]
@InProceedings{Chan_2025_ICCV, author = {Chan, Sixian and Li, Zedong and Li, Wenhao and Lu, Shijian and Shen, Chunhua and Zhang, Xiaoqin}, title = {SMSTracker: Tri-path Score Mask Sigma Fusion for Multi-Modal Tracking}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {4766-4775} }