Self-Supervised Learning of Semantic Correspondence Using Web Videos

Kwon, Donghyeon; Cho, Minsu; Kwak, Suha

Donghyeon Kwon, Minsu Cho, Suha Kwak; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2024, pp. 2142-2152

Abstract

Existing datasets for semantic correspondence are often limited in terms of both the amount of labeled data and diversity of labeled keypoints due to the tremendous cost of manual correspondence labeling. To address this issue, we propose the first self-supervised learning framework that utilizes a large amount of web videos collected and annotated fully automatically. Our main motivation is that smooth changes between consecutive video frames allow to build accurate space-time correspondences with no human intervention. Hence, we establish space-time correspondences within each web video and leverage them for deriving pseudo correspondence labels between two distant frames of the video. In addition, we present a dedicated training strategy that facilitates stable training using web videos with such pseudo labels. Our experiments on public benchmarks demonstrated that the proposed method surpasses existing self-supervised learning models and that our self-supervised learning as pretraining for supervised learning improves performance substantially. Our codebase for web video crawling and pseudo label generation will be released public to promote future research.

Related Material

[pdf]

[bibtex]

@InProceedings{Kwon_2024_WACV, author = {Kwon, Donghyeon and Cho, Minsu and Kwak, Suha}, title = {Self-Supervised Learning of Semantic Correspondence Using Web Videos}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2024}, pages = {2142-2152} }