Contrastive Losses Are Natural Criteria for Unsupervised Video Summarization

Zongshang Pang, Yuta Nakashima, Mayu Otani, Hajime Nagahara; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2023, pp. 2010-2019

Abstract


Video summarization aims to select a most informative subset of frames in a video to facilitate efficient video browsing. Unsupervised methods usually rely on heuristic training objectives such as diversity and representativeness. However, such methods need to bootstrap the online-generated summaries to compute the objectives for importance score regression. We consider such a pipeline inefficient and seek to directly quantify the frame-level importance with the help of contrastive losses in the representation learning literature. Leveraging the contrastive losses, we propose three metrics featuring a desirable key frame: local dissimilarity, global consistency, and uniqueness. With features pre-trained on an image classification task, the metrics can already yield high-quality importance scores, demonstrating better or competitive performance compared with past heavily-trained methods. We show that by refining the pre-trained features with contrastive learning, the frame-level importance scores can be further improved, and the model can learn from random videos and generalize to test videos with decent performance.

Related Material


[pdf] [supp] [arXiv]
[bibtex]
@InProceedings{Pang_2023_WACV, author = {Pang, Zongshang and Nakashima, Yuta and Otani, Mayu and Nagahara, Hajime}, title = {Contrastive Losses Are Natural Criteria for Unsupervised Video Summarization}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, month = {January}, year = {2023}, pages = {2010-2019} }