Regularizing Long Short Term Memory With 3D Human-Skeleton Sequences for Action Recognition

Behrooz Mahasseni, Sinisa Todorovic; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 3054-3062

Abstract


This paper argues that large-scale action recognition in video can be greatly improved by providing an additional modality in training data -- namely, 3D human-skeleton sequences -- aimed at complementing poorly represented or missing features of human actions in the training videos. For recognition, we use Long Short Term Memory (LSTM) grounded via a deep Convolutional Neural Network (CNN) onto the video. Training of LSTM is regularized using the output of another encoder LSTM (eLSTM) grounded on 3D human-skeleton training data. For such regularized training of LSTM, we modify the standard backpropagation through time (BPTT) in order to address the well-known issues with gradient descent in constraint optimization. Our evaluation on three benchmark datasets -- Sports-1M, HMDB-51, and UCF101 -- shows accuracy improvements from 5.3% up to 17.4% relative to the state of the art.

Related Material


[pdf] [video]
[bibtex]
@InProceedings{Mahasseni_2016_CVPR,
author = {Mahasseni, Behrooz and Todorovic, Sinisa},
title = {Regularizing Long Short Term Memory With 3D Human-Skeleton Sequences for Action Recognition},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2016}
}