CoCon: Cooperative-Contrastive Learning

Nishant Rai, Ehsan Adeli, Kuan-Hui Lee, Adrien Gaidon, Juan Carlos Niebles; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2021, pp. 3384-3393


Labeling videos at scale is impractical. Consequently, self-supervised visual representation learning is key for efficient video analysis. Recent success in learning image representations suggest contrastive learning is a promising framework to tackle this challenge. However, when applied to real-world videos, contrastive learning may unknowingly lead to separation of instances that contain semantically similar events. In our work, we introduce a cooperative variant of contrastive learning to address this issue. We use data-driven sampling to leverage implicit relationships between multiple input video views, whether observed (e.g. RGB) or inferred (e.g. flow, segmentation masks, poses). We experimentally evaluate our representations on the downstream task of action recognition. Our method sets a new state of the art on standard benchmarks (UCF101, HMDB51, Kinetics400). Furthermore, qualitative experiments illustrate that our models can capture higher-order class relationships. The code is available at

Related Material

[pdf] [supp] [arXiv]
@InProceedings{Rai_2021_CVPR, author = {Rai, Nishant and Adeli, Ehsan and Lee, Kuan-Hui and Gaidon, Adrien and Niebles, Juan Carlos}, title = {CoCon: Cooperative-Contrastive Learning}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2021}, pages = {3384-3393} }