Neural Sign Language Synthesis: Words Are Our Glosses

Jan Zelinka, Jakub Kanis; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), 2020, pp. 3395-3403

Abstract


This paper deals with a text-to-video sign language synthesis. Instead of direct video production, we focused on skeletal models production. Our main goal in this paper was to design the first fully end-to-end automatic sign language synthesis system trained only on available free data (daily TV broadcasting). Thus, we excluded any manual video annotation. Furthermore, our designed approach even do not rely on any video segmentation. A proposed feed-forward transformer and recurrent transformer were investigated. To improve the performance of our sequence-to-sequence transformer, soft non-monotonic attention was employed in our training process. A benefit of character-level features was compared with word-level features. Besides a novel approach to sign language synthesis, we also present a gradient-descend-based method for the skeletal model estimation improvement. This improvement not only smooths skeletal models and interpolates missing bones but it also creates 3D skeletal models from 2D models. We focused our experiments on a weather forecasting dataset in the Czech Sign Language.

Related Material


[pdf] [supp] [video]
[bibtex]
@InProceedings{Zelinka_2020_WACV,
author = {Zelinka, Jan and Kanis, Jakub},
title = {Neural Sign Language Synthesis: Words Are Our Glosses},
booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
month = {March},
year = {2020}
}