- [pdf] [supp] [code]
TCVM: Temporal Contrasting Video Montage Framework for Self-supervised Video Representation Learning
Extracting appropriate temporal differences and ignoring irrelevant backgrounds are two important perspectives on preserving sufficient motion information in video representation. In this paper, we propose a unified contrastive learning framework called Temporal Contrasting Video Montage (TCVM) to learn action-specific motion patterns, which can be implemented in a plug-and-play way. On the one hand, Temporal Contrasting (TC) module is designed to guarantee appropriate temporal difference between frames. It utilizes high-level feature space to capture raveled temporal information. On the other hand, Video Montage (VM) module is devised for alleviating the effect from video background. It demonstrates similar temporal motion variances in different positive samples by implicitly mixing up the backgrounds of different videos. Experimental results show that our TCVM reaches promising performances on both large action recognition dataset (i.e. Something-Somethingv2) and small datasets (i.e. UCF101 and HMDB51).