Beyond Short Snippets: Deep Networks for Video Classification

Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat Monga, George Toderici; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4694-4702


Convolutional neural networks (CNNs) have been exten- sively applied for image recognition problems giving state- of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image infor- mation across a video over longer time periods than previ- ously attempted. We propose two methods capable of han- dling full length videos. The first method explores various convolutional temporal feature pooling architectures, ex- amining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improve- ments over previously published results on the Sports 1 mil- lion dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.2% vs. 87.9%) and without additional optical flow information (82.6% vs. 72.8%).

Related Material

author = {Yue-Hei Ng, Joe and Hausknecht, Matthew and Vijayanarasimhan, Sudheendra and Vinyals, Oriol and Monga, Rajat and Toderici, George},
title = {Beyond Short Snippets: Deep Networks for Video Classification},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2015}