XTrack: Multimodal Training Boosts RGB-X Video Object Trackers

Tan, Yuedong; Wu, Zongwei; Fu, Yuqian; Zhou, Zhuyun; Sun, Guolei; Zamfir, Eduard; Ma, Chao; Paudel, Danda; Van Gool, Luc; Timofte, Radu

Yuedong Tan, Zongwei Wu, Yuqian Fu, Zhuyun Zhou, Guolei Sun, Eduard Zamfir, Chao Ma, Danda Paudel, Luc Van Gool, Radu Timofte; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2025, pp. 5734-5744

Abstract

Multimodal sensing has proven valuable for visual tracking, as different sensor types offer unique strengths in handling one specific challenging scene where object appearance varies. While a generalist model capable of leveraging all modalities would be ideal, development is hindered by data sparsity, typically in practice, only one modality is available at a time. Therefore, it is crucial to ensure and achieve that knowledge gained from multimodal sensing -- such as identifying relevant features and regions -- is effectively shared, even when certain modalities are unavailable at inference. We venture with a simple assumption: similar samples across different modalities have more knowledge to share than otherwise. To implement this, we employ a classifier with weak loss tasked with distinguishing between modalities. More specifically, if the classifier "fails" to accurately identify the modality of the given sample, this signals an opportunity for cross-modal knowledge sharing. Intuitively, knowledge transfer is facilitated whenever a sample from one modality is sufficiently close and aligned with another. Technically, we achieve this by routing samples from one modality to the expert of the others, within a mixture-of-experts framework designed for multimodal video object tracking. During the inference, the expert of the respective modality is chosen, which we show to benefit from the multimodal knowledge available during training, thanks to the proposed method. Through the exhaustive experiments that use only paired RGB-E, RGB-D, and RGB-T during training, we showcase the benefit of the proposed method for RGB-X tracker during inference, with an average +3% precision improvement over the current SOTA. The source code is publicly available at https://github.com/supertyd/XTrack.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Tan_2025_ICCV, author = {Tan, Yuedong and Wu, Zongwei and Fu, Yuqian and Zhou, Zhuyun and Sun, Guolei and Zamfir, Eduard and Ma, Chao and Paudel, Danda and Van Gool, Luc and Timofte, Radu}, title = {XTrack: Multimodal Training Boosts RGB-X Video Object Trackers}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {5734-5744} }