Continuous Copy-Paste for One-Stage Multi-Object Tracking and Segmentation
Current one-step multi-object tracking and segmentation (MOTS) methods lag behind recent two-step methods. By separating the instance segmentation stage from the tracking stage, two-step methods can exploit non-video datasets as extra data for training instance segmentation. Moreover, instances belonging to different IDs on different frames, rather than limited numbers of instances in raw consecutive frames, can be gathered to allow more effective hard example mining in the training of trackers. In this paper, we bridge this gap by presenting a novel data augmentation strategy named continuous copy-paste (CCP). Our intuition behind CCP is to fully exploit the pixel-wise annotations provided by MOTS to actively increase the number of instances as well as unique instance IDs in training. Without any modifications to frameworks, current MOTS methods achieve significant performance gains when trained with CCP. Based on CCP, we propose the first effective one-stage online MOTS method named CCPNet, which generates instance masks as well as the tracking results in one shot. Our CCPNet surpasses all state-of-the-art methods by large margins (3.8% higher sMOTSA and 4.1% higher MOTSA for pedestrians on the KITTI MOTS Validation) and ranks 1st on the KITTI MOTS leaderboard. Evaluations across three datasets also demonstrate the effectiveness of both CCP and CCPNet. Our codes are publicly available at: https://github.com/detectRecog/CCP.