VLAD3: Encoding Dynamics of Deep Features for Action Recognition

Yingwei Li, Weixin Li, Vijay Mahadevan, Nuno Vasconcelos; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 1951-1960

Abstract


Previous approaches to action recognition with deep features tend to process video frames only within a small temporal region, and do not model long-range dynamic information explicitly. However, such information is important for the accurate recognition of actions, especially for the discrimination of complex activities that share sub-actions, and when dealing with untrimmed videos. Here, we propose a representation, VLAD for Deep Dynamics (VLAD^3), that accounts for different levels of video dynamics. It captures short-term dynamics with deep convolutional neural network features, relying on linear dynamic systems (LDS) to model medium-range dynamics. To account for long-range inhomogeneous dynamics, a VLAD descriptor is derived for the LDS and pooled over the whole video, to arrive at the final VLAD^3 representation. An extensive evaluation was performed on Olympic Sports, UCF101 and THUMOS15, where the use of the VLAD^3 representation leads to state-of- the-art results.

Related Material


[pdf]
[bibtex]
@InProceedings{Li_2016_CVPR,
author = {Li, Yingwei and Li, Weixin and Mahadevan, Vijay and Vasconcelos, Nuno},
title = {VLAD3: Encoding Dynamics of Deep Features for Action Recognition},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2016}
}