A Key Volume Mining Deep Framework for Action Recognition

Wangjiang Zhu, Jie Hu, Gang Sun, Xudong Cao, Yu Qiao; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1991-1999


Recently, deep learning approaches have demonstrated remarkable progresses for action recognition in videos. Most existing deep frameworks equally treat every volume i.e. spatial-temporal video clip, and directly assign a video label to all volumes sampled from it. However, within a video, discriminative actions may occur sparsely in a few key volumes, and most other volumes are irrelevant to the labeled action category. Training with a large proportion of irrelevant volumes will hurt performance. To address this issue, we propose a key volume mining deep framework to identify key volumes and conduct classification simultaneously. Specifically, our framework is trained end-to-end in an EM-like loop. In the forward pass, our network mines key volumes for each action class. In the backward pass, it updates network parameters with the help of these mined key volumes. In addition, we propose "Stochastic out" to handle key volumes from multi-modalities, and an effective yet simple "unsupervised key volume proposal" method for high quality volume sampling. Our experiments show that action recognition performance can be significantly improved by mining key volumes, and our methods achieve state-of-the-art performance on UCF101 (93.1%).

Related Material

author = {Zhu, Wangjiang and Hu, Jie and Sun, Gang and Cao, Xudong and Qiao, Yu},
title = {A Key Volume Mining Deep Framework for Action Recognition},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2016}