Temporal U-Nets for Video Summarization with Scene and Action Recognition

Heeseung Kwon, Woohyun Shim, Minsu Cho; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019, pp. 0-0

Abstract


While videos contain long-term temporal information with diverse contents, existing approaches to video understanding usually focus on a short trimmed video clip with a specific content such as a particular action or object. For comprehensive understanding of untrimmed videos, we address an integrated video task of video summarization with scene and action recognition. We propose a novel convolutional neural network architecture for handling untrimmed videos with multiple contents. The proposed architecture is an encoder-decoder structure where the encoder captures long-term temporal dynamics from an entire video and the decoder predicts detailed temporal information of multiple contents of the video. Two-stream processing is adopted for obtaining feature representations, one for focusing on the spatial information and the other for the temporal information. We evaluate the proposed method on the benchmark of the Challenge on Comprehensive Video Understanding in the Wild (CoVieW 2019), and the experimental results demonstrate that our method achieves outstanding performance

Related Material


[pdf]
[bibtex]
@InProceedings{Kwon_2019_ICCV,
author = {Kwon, Heeseung and Shim, Woohyun and Cho, Minsu},
title = {Temporal U-Nets for Video Summarization with Scene and Action Recognition},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops},
month = {Oct},
year = {2019}
}