ASCNet: Self-Supervised Video Representation Learning With Appearance-Speed Consistency

Huang, Deng; Wu, Wenhao; Hu, Weiwen; Liu, Xu; He, Dongliang; Wu, Zhihua; Wu, Xiangmiao; Tan, Mingkui; Ding, Errui

Deng Huang, Wenhao Wu, Weiwen Hu, Xu Liu, Dongliang He, Zhihua Wu, Xiangmiao Wu, Mingkui Tan, Errui Ding; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 8096-8105

Abstract

We study self-supervised video representation learning, which is a challenging task due to 1) sufficient labels for supervision; 2) unstructured and noisy visual information. Existing methods mainly use contrastive loss with video clips as the instances and learn visual representation by discriminating instances from each other, but they need a careful treatment of negative pairs by either relying on large batch sizes, memory banks, extra modalities or customized mining strategies, which inevitably includes noisy data. In this paper, we observe that the consistency between positive samples is the key to learn robust video representation. Specifically, we propose two tasks to learn appearance and speed consistency, respectively. The appearance consistency task aims to maximize the similarity between two clips of the same video with different playback speeds. The speed consistency task aims to maximize the similarity between two clips with the same playback speed but different appearance information. We show that optimizing the two tasks jointly consistently improves the performance on downstream tasks, e.g., action recognition and video retrieval. Remarkably, for action recognition on the UCF-101 dataset, we achieve 90.8% accuracy without using any extra modalities or negative pairs for unsupervised pre-training, which outperforms the ImageNet supervised pre-trained model. Codes and models will be available.

Related Material

[pdf] [supp] [arXiv]

[bibtex]

@InProceedings{Huang_2021_ICCV, author = {Huang, Deng and Wu, Wenhao and Hu, Weiwen and Liu, Xu and He, Dongliang and Wu, Zhihua and Wu, Xiangmiao and Tan, Mingkui and Ding, Errui}, title = {ASCNet: Self-Supervised Video Representation Learning With Appearance-Speed Consistency}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2021}, pages = {8096-8105} }