-
[pdf]
[bibtex]@InProceedings{Fahim_2024_CVPR, author = {Fahim, Masud An-Nur Islam and Boutellier, Jani}, title = {CheckMATE: Efficient Video Summarization by Checking Mutually Averaged Temporal Encapsulation}, booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops}, month = {June}, year = {2024}, pages = {8343-8348} }
CheckMATE: Efficient Video Summarization by Checking Mutually Averaged Temporal Encapsulation
Abstract
Video classification is a computationally demanding task at inference time but especially at training time. The computation burden originates both from the number of training sequences needed and from the high-volume data content of each sequence. On the model architecture side video recognition is dominated by 3D ConvNets that are computationally much more demanding than their 2D counterparts. This paper proposes a simple yet efficient solution for large-scale learning from videos: the entire video clip is summarized into a single frame which offers visual recognition performance comparable to the original video stream. The proposed video summarization algorithm distills the input video into a single frame in the feature space of the image classifier. After compressing the videos into individual frames regular image classification training is performed for the purpose of action recognition. We validate the performance of our approach on UCF101 and HMDB51 datasets and observe results comparable to competing approaches that leverage expensive 3D ConvNets. In contrast our approach uses only 2D image classification networks and does not require any pre-training on video datasets.
Related Material