A Multi-scale Approach to Gesture Detection and Recognition

Natalia Neverova, Christian Wolf, Giulio Paci, Giacomo Sommavilla, Graham W. Taylor, Florian Nebout; Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops, 2013, pp. 484-491


We propose a generalized approach to human gesture recognition based on multiple data modalities such as depth video, articulated pose and speech. In our system, each gesture is decomposed into large-scale body motion and local subtle movements such as hand articulation. The idea of learning at multiple scales is also applied to the temporal dimension, such that a gesture is considered as a set of characteristic motion impulses, or dynamic poses. Each modality is first processed separately in short spatio-temporal blocks, where discriminative data-specific features are either manually extracted or learned. Finally, we employ a Recurrent Neural Network for modeling large-scale temporal dependencies, data fusion and ultimately gesture classification. Our experiments on the 2013 Challenge on Multimodal Gesture Recognition dataset have demonstrated that using multiple modalities at several spatial and temporal scales leads to a significant increase in performance allowing the model to compensate for errors of individual classifiers as well as noise in the separate channels.

Related Material

author = {Natalia Neverova and Christian Wolf and Giulio Paci and Giacomo Sommavilla and Graham W. Taylor and Florian Nebout},
title = {A Multi-scale Approach to Gesture Detection and Recognition},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision (ICCV) Workshops},
month = {June},
year = {2013}