Recognizing American Sign Language Gestures From Within Continuous Videos

Yuancheng Ye, Yingli Tian, Matt Huenerfauth, Jingya Liu; The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2018, pp. 2064-2073

Abstract


In this paper, we propose a novel hybrid model, 3D recurrent convolutional neural networks (3DRCNN), to recognize American Sign Language (ASL) gestures and localize their temporal boundaries within continuous videos, by fusing multi-modality features. Our proposed 3DRCNN model integrates 3D convolutional neural network (3DCNN) and enhanced fully connected recurrent neural network (FC-RNN), where 3DCNN learns multi-modality features from RGB, motion, and depth channels, and FC-RNN captures the temporal information among short video clips divided from the original video. Consecutive clips with the same semantic meaning are singled out by applying the sliding window approach to segment the clips on the entire video sequence. To evaluate our method, we collected a new ASL dataset which contains two types of videos: Sequence videos (in which a human performs a list of specific ASL words) and Sentence videos (in which a human performs ASL sentences, containing multiple ASL words). The dataset is fully annotated for each semantic region (i.e. the time duration of each word that the human signer performs) and contains multiple input channels. Our proposed method achieves 69.2% accuracy on the Sequence videos for 27 ASL words, which demonstrates its effectiveness of detecting ASL gestures from continuous videos.

Related Material


[pdf]
[bibtex]
@InProceedings{Ye_2018_CVPR_Workshops,
author = {Ye, Yuancheng and Tian, Yingli and Huenerfauth, Matt and Liu, Jingya},
title = {Recognizing American Sign Language Gestures From Within Continuous Videos},
booktitle = {The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops},
month = {June},
year = {2018}
}