Structured Images for RGB-D Action Recognition

Pichao Wang, Shuang Wang, Zhimin Gao, Yonghong Hou, Wanqing Li; Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1005-1014

Abstract


This paper presents an effective yet simple video representation for RGB-D based action recognition. It proposes to represent a depth map sequence into three pairs of structured dynamic images at body, part and joint levels respectively through bidirectional rank pooling. Different from previous works that applied one Convolutional Neural Network (ConvNet) for each part/joint separately, one pair of structured dynamic images is constructed from depth maps at each granularity level and serves as the input of a ConvNet. The structured dynamic image not only preserves the spatial-temporal information but also enhances the structure information across both body parts/joints and different temporal scales. In addition, it requires low computational cost and memory to construct. The proposed representation is evaluated on five benchmark datasets, namely, MSRAction3D, G3D, MSRDailyActivity3D, SYSU 3D HOI and UTD-MHAD datasets and achieves the state-of-the-art results on all five datasets.

Related Material


[pdf] [supp]
[bibtex]
@InProceedings{Wang_2017_ICCV,
author = {Wang, Pichao and Wang, Shuang and Gao, Zhimin and Hou, Yonghong and Li, Wanqing},
title = {Structured Images for RGB-D Action Recognition},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops},
month = {Oct},
year = {2017}
}