Learning Tracking Representations from Single Point Annotations

Qiangqiang Wu, Antoni B. Chan; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 2606-2615

Abstract


Existing deep trackers are typically trained with large-scale video frames with annotated bounding boxes. However these bounding boxes are expensive and time-consuming to annotate in particular for large scale datasets. In this paper we propose to learn tracking representations from single point annotations (i.e. 4.5x faster to annotate than the traditional bounding box) in a weakly supervised manner. Specifically we propose a soft contrastive learning (SoCL) framework that incorporates target objectness prior into end-to-end contrastive learning. Our SoCL consists of adaptive positive and negative sample generation which is memory-efficient and effective for learning tracking representations. We apply the learned representation of SoCL to visual tracking and show that our method can 1) achieve better performance than the fully supervised baseline trained with box annotations under the same annotation time cost; 2) achieve comparable performance of the fully supervised baseline by using the same number of training frames and meanwhile reducing annotation time cost by 78% and total fees by 85%; 3) be robust to annotation noise.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Wu_2024_CVPR, author = {Wu, Qiangqiang and Chan, Antoni B.}, title = {Learning Tracking Representations from Single Point Annotations}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {2606-2615} }