Deep Local Video Feature for Action Recognition

Zhenzhong Lan, Yi Zhu, Alexander G. Hauptmann, Shawn Newsam; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2017, pp. 1-7


We investigate the problem of representing an entire video using CNN features for human action recognition. End-to-end learning of CNN/RNNs is currently not possible for whole videos due to GPU memory limitations and so a common practice is to use sampled frames as inputs along with the video labels as supervision. However, the global video labels might not be suitable for all of the temporally local samples as the videos often contain content besides the action of interest. We therefore propose to instead treat the deep networks trained on local inputs as local feature extractors. The local features are then aggregated to form global features which are used to assign video-level labels through a second classification stage. We investigate a number of design choices for this local feature approach. Experimental results on the HMDB51 and UCF101 datasets show that a simple maximum pooling on the sparsely sampled local features leads to significant performance improvement.

Related Material

[pdf] [arXiv]
author = {Lan, Zhenzhong and Zhu, Yi and Hauptmann, Alexander G. and Newsam, Shawn},
title = {Deep Local Video Feature for Action Recognition},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {July},
year = {2017}