Mining Better Samples for Contrastive Learning of Temporal Correspondence

Sangryul Jeon, Dongbo Min, Seungryong Kim, Kwanghoon Sohn; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 1034-1044

Abstract


We present a novel framework for contrastive learning of pixel-level representation using only unlabeled video. Without the need of ground-truth annotation, our method is capable of collecting well-defined positive correspondences by measuring their confidences and well-defined negative ones by appropriately adjusting their hardness during training. This allows us to suppress the adverse impact of ambiguous matches and prevent a trivial solution from being yielded by too hard or too easy negative samples. To accomplish this, we incorporate three different criteria that ranges from a pixel-level matching confidence to a video-level one into a bottom-up pipeline, and plan a curriculum that is aware of current representation power for the adaptive hardness of negative samples during training. With the proposed method, state-of-the-art performance is attained over the latest approaches on several video label propagation tasks.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Jeon_2021_CVPR, author = {Jeon, Sangryul and Min, Dongbo and Kim, Seungryong and Sohn, Kwanghoon}, title = {Mining Better Samples for Contrastive Learning of Temporal Correspondence}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2021}, pages = {1034-1044} }