Hierarchical Self-Supervised Representation Learning for Movie Understanding

Fanyi Xiao, Kaustav Kundu, Joseph Tighe, Davide Modolo; Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 9727-9736

Abstract


Most self-supervised video representation learning approaches focus on action recognition. In contrast, in this paper we focus on self-supervised video learning for movie understanding and propose a novel hierarchical self-supervised pretraining strategy that separately pretrains each level of our hierarchical movie understanding model. Specifically, we propose to pretrain the low-level video backbone using a contrastive learning objective, while pretrain the higher-level video contextualizer using an event mask prediction task, which enables the usage of different data sources for pretraining different levels of the hierarchy. We first show that our self-supervised pretraining strategies are effective and lead to improved performance on all tasks and metrics on VidSitu benchmark (e.g., improving on semantic role prediction from 47% to 61% CIDEr scores). We further demonstrate the effectiveness of our contextualized event features on LVU tasks, both when used alone and when combined with instance features, showing their complementarity.

Related Material


[pdf] [arXiv]
[bibtex]
@InProceedings{Xiao_2022_CVPR, author = {Xiao, Fanyi and Kundu, Kaustav and Tighe, Joseph and Modolo, Davide}, title = {Hierarchical Self-Supervised Representation Learning for Movie Understanding}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, month = {June}, year = {2022}, pages = {9727-9736} }