Adaptive Capacity Autoregressive Visual Tracking

Tong Lin, Yifan Bai, Shiyi Liang, Ruigang Niu, Xing Wei; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 13574-13583

Abstract


We present ARTrack-AC, a new step in the autoregressive tracking paradigm that introduces adaptive capacity inference to achieve both temporal consistency and dynamic efficiency. While existing autoregressive trackers predict object states sequentially with fixed inference capacity, they fail to accommodate the fluctuating temporal difficulty of real videos. ARTrack-AC addresses this limitation by equipping the tracker with the ability to modulate its inference capacity over time. A diffusion-based difficulty estimator anticipates the stability of upcoming segments, guiding a controller to switch between an accurate (high-capacity) and an efficient (low-capacity) mode while maintaining autoregressive consistency. This system-level autoregression extends conventional sequence modeling beyond "what to predict" toward "how to predict," forming a self-regulated tracking process that aligns inference cost with temporal complexity. Despite its simplicity, ARTrack-AC achieves state-of-the-art accuracy-speed trade-off on major benchmarks--66.7% AUC on LaSOT and 47.5% AUC on LaSOText--running 2.9xfaster than its predecessor.

Related Material


[pdf]
[bibtex]
@InProceedings{Lin_2026_CVPR, author = {Lin, Tong and Bai, Yifan and Liang, Shiyi and Niu, Ruigang and Wei, Xing}, title = {Adaptive Capacity Autoregressive Visual Tracking}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2026}, pages = {13574-13583} }