Multi-Object Tracking by Self-Supervised Learning Appearance Model
In recent years, dominant multi-object tracking (MOT) and segmentation (MOTS) methods mainly follow the tracking-by-detection paradigm. Transformer-based end to end (E2E) solutions bring some ideas to MOT and MOTS, but they can not achieve a new state-of-the-art (SOTA) performance in major MOT and MOTS benchmarks. Detection and association are two main modules of the tracking-by-detection paradigm. Association techniques mainly depend on the combination of motion and appearance information. As deep learning has been recently developed, the performance of the detection and appearance model is rapidly improved. These trends made us consider whether we can achieve SOTA based on only high-performance detection and appearance model. Our paper mainly focuses on exploring this direction based on CBNetV2 with Swin-B as a detection model and MoCo-v2 as a self-supervised appearance model. Motion information and IoU mapping were removed during the association. Our method achieves SOTA results on 2 mainstream MOT datasets and 1 MOTS dataset which is BDD100K MOT, WAYMO 2D Tracking, BDD100K MOTS. Our method yielded a significant improvement of +10.7% and +33.7%, respectively on BDD 100K MOT and MOTS benchmark. The proposed method won first place in BDD100K Multiple Object Tracking (MOT) challenges at CVPR 2022 Workshop on Autonomous Driving. Our method also won first place in BDD100K Multiple Object Tracking (MOT) and Multiple Object Tracking and Segmentation (MOTS) challenges at ECCV 2022 Self-supervised Learning for Next-Generation Industry-level Autonomous Driving (SSLAD) Workshop. We hope our simple and effective method can give some insights to the MOT and MOTS research community.