MVLM: Template-Free Tracking via Vision-Language Margin Confidence and Memory-Gated Tracking

Park, Dae-Hyeon; Baek, Mina; Ha, Jeong-Hun; Park, Chan-Seop; Ganiev, Jamshidjon; Bae, Seung-Hwan

Dae-Hyeon Park, Mina Baek, Jeong-Hun Ha, Chan-Seop Park, Jamshidjon Ganiev, Seung-Hwan Bae; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 35156-35165

Abstract

We introduce a new template-free tracking paradigm based solely on natural language, capable of tracking an arbitrary object and seamlessly switching to a new target without box initialization.Our key idea is to localize an object via vision-language (VL) correlation.However, using the correlation alone is brittle under large search regions due to spatial uncertainty and ambiguous VL saliency. To resolve these, we propose MVLM, a memory-based vision-language margin confidence that integrates vision-language correlation, encoder prediction, and temporal memory.MVLM dynamically gates the search region--switching between compact ROI (Region of Interest) search and global re-localization--to reduce spatial uncertainty. Theoretically, we derive bounds that connect the MVLM score to tracking probability, characterizing mis-localization within ROI and ROI-exclusion probabilities.Through extensive evaluation, we validate our theorems and achieve state-of-the-art performance on several benchmarks (TNL2K, LaSOT, OTB99 and MGIT) using only language guidance.

Related Material

[pdf] [supp]

[bibtex]

@InProceedings{Park_2026_CVPR, author = {Park, Dae-Hyeon and Baek, Mina and Ha, Jeong-Hun and Park, Chan-Seop and Ganiev, Jamshidjon and Bae, Seung-Hwan}, title = {MVLM: Template-Free Tracking via Vision-Language Margin Confidence and Memory-Gated Tracking}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {35156-35165} }