Temporal Contrastive Pretraining for Video Action Recognition

Guillaume LORRE, Jaonary Rabarisoa, Astrid Orcesi, Samia Ainouz, Stephane Canu; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 662-670


In this paper, we propose a self-supervised method for video representation learning based on Contrastive Predictive Coding (CPC) [27]. Previously, CPC has been used to learn representations for different signals (audio, text or image). It benefits from the use of an autoregressive modeling and contrastive estimation to learn long-term relations inside raw signal while remaining robust to local noise. Our self-supervised task consists in predicting the latent representation of future segments of the video. As opposed to generative models, predicting directly in the feature space is easier and avoid incertitude problems for long-term predictions. Today, using CPC to learn representations for videos remains challenging due to the structure and the high dimensionality of the signal. We demonstrate experimentally that the representations learned by the network are useful for action recognition. We test it with different input types such as optical flows, image differences and raw images on different datasets (UCF-101 and HMDB51). It gives consistent results across the modalities. At last, we notice the utility of our pre-training method by achieving competitive results for action recognition using few labeled data.

Related Material

[pdf] [video]
author = {LORRE, Guillaume and Rabarisoa, Jaonary and Orcesi, Astrid and Ainouz, Samia and Canu, Stephane},
title = {Temporal Contrastive Pretraining for Video Action Recognition},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
month = {March},
year = {2020}