Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers

Xie, Jinxia; Zhong, Bineng; Mo, Zhiyi; Zhang, Shengping; Shi, Liangtao; Song, Shuxiang; Ji, Rongrong

Jinxia Xie, Bineng Zhong, Zhiyi Mo, Shengping Zhang, Liangtao Shi, Shuxiang Song, Rongrong Ji; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 19300-19309

Abstract

The rich spatio-temporal information is crucial to capture the complicated target appearance variations in visual tracking. However most top-performing tracking algorithms rely on many hand-crafted components for spatio-temporal information aggregation. Consequently the spatio-temporal information is far away from being fully explored. To alleviate this issue we propose an adaptive tracker with spatio-temporal transformers (named AQATrack) which adopts simple autoregressive queries to effectively learn spatio-temporal information without many hand-designed components. Firstly we introduce a set of learnable and autoregressive queries to capture the instantaneous target appearance changes in a sliding window fashion. Then we design a novel attention mechanism for the interaction of existing queries to generate a new query in current frame. Finally based on the initial target template and learnt autoregressive queries a spatio-temporal information fusion module (STM) is designed for spatiotemporal formation aggregation to locate a target object. Benefiting from the STM we can effectively combine the static appearance and instantaneous changes to guide robust tracking. Extensive experiments show that our method significantly improves the tracker's performance on six popular tracking benchmarks: LaSOT LaSOText TrackingNet GOT-10k TNL2K and UAV123. Code and models will be https://github.com/orgs/GXNU-ZhongLab.

Related Material

[pdf]

[bibtex]

@InProceedings{Xie_2024_CVPR, author = {Xie, Jinxia and Zhong, Bineng and Mo, Zhiyi and Zhang, Shengping and Shi, Liangtao and Song, Shuxiang and Ji, Rongrong}, title = {Autoregressive Queries for Adaptive Tracking with Spatio-Temporal Transformers}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2024}, pages = {19300-19309} }