Isolated Sign Language Recognition With Multi-Scale Spatial-Temporal Graph Convolutional Networks
Isolated Sign Language Recognition (ISLR) fits nicely in the domain of problems that can be handled by graph-structured spatial-temporal algorithms. A recent multiscale spatial-temporal graph convolution operator, MS-G3D, takes advantage of the semantic connectivity among non-neighbor nodes of the graph in a flexible temporal scale, which results in improved performance in classical Human Action Recognition datasets. In this work, we present a solution for ISLR using a skeleton graph that includes body and finger joints and makes use of this specific property of MS-G3D, which seems crucial to capture the internal relationship among semantically connected distant nodes in sign language dynamics. To complete the analysis, we compare the results with a 3D-CNN architecture, S3D, already used for SLR, and fuse it with MS-G3D. The performance achieved on the AUTSL dataset shows that MS-G3D alone stands out as a viable technique for ISLR. In fact, the improvement after fusing with a 3D-CNN approach, at least on this medium-scale dataset, appears marginal. The transfer learning capability of the trained models is also explored using pre-training with the larger WLASL dataset and post-training with the smaller LSE UVIGO dataset. The classification performance based on the MS-G3D model over AUTSL does not benefit from pre-training with WLASL, but the performance on the more similarly acquired LSE UVIGO dataset improves significantly from fine-tuning the MS-G3D AUTSL model.