Multi-Annotation Attention Model for Video Summarization
In the last decade, the supply of online video content exploded. Automatic video summarization has become necessary to allow content consumers to briefly glance at the video's content. However, the notion of video summary is subjective and thus requires multiple annotators to define the ground truth. Existing video summarization techniques are limited in many ways. First, existing summarization techniques aggregate multiple annotations using the average operation and use these estimates to train a learning model to make predictions on unseen videos. Second, the use of RNN-based architecture to model long-range dependencies. Third, the amount of annotated data available for general video summarization is too small to train visual models from scratch. To mitigate these issues, this work proposes a new end-to-end probabilistic framework called Multi-Annotation Attention Model (MAAM) optimized using the Expectation-Maximization algorithm where the true label is treated as a latent variable. The MAAM framework has several advantages: (i) it exploits multiple annotations from different human-labelers and thus combines model training with the label aggregation, (ii) it models the temporal dynamics representations of videos through an attention mechanism, and (iii) benefits from the power of pretrained visual encoders namely the Vision Transformer (ViT). The proposed approach is evaluated on two public datasets TVSum and SumMe. Our method significantly outperforms state-of-the-art methods on both datasets.