TCVM: Temporal Contrasting Video Montage Framework for Self-supervised Video Representation Learning

Fengrui Tian, Jiawei Fan, Xie Yu, Shaoyi Du, Meina Song, Yu Zhao; Proceedings of the Asian Conference on Computer Vision (ACCV), 2022, pp. 1539-1555

Abstract


Extracting appropriate temporal differences and ignoring irrelevant backgrounds are two important perspectives on preserving sufficient motion information in video representation. In this paper, we propose a unified contrastive learning framework called Temporal Contrasting Video Montage (TCVM) to learn action-specific motion patterns, which can be implemented in a plug-and-play way. On the one hand, Temporal Contrasting (TC) module is designed to guarantee appropriate temporal difference between frames. It utilizes high-level feature space to capture raveled temporal information. On the other hand, Video Montage (VM) module is devised for alleviating the effect from video background. It demonstrates similar temporal motion variances in different positive samples by implicitly mixing up the backgrounds of different videos. Experimental results show that our TCVM reaches promising performances on both large action recognition dataset (i.e. Something-Somethingv2) and small datasets (i.e. UCF101 and HMDB51).

Related Material


[pdf] [supp] [code]
[bibtex]
@InProceedings{Tian_2022_ACCV, author = {Tian, Fengrui and Fan, Jiawei and Yu, Xie and Du, Shaoyi and Song, Meina and Zhao, Yu}, title = {TCVM: Temporal Contrasting Video Montage Framework for Self-supervised Video Representation Learning}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {December}, year = {2022}, pages = {1539-1555} }