Fingerspelling PoseNet: Enhancing Fingerspelling Translation With Pose-Based Transformer Models

Pooya Fayyazsanavi, Negar Nejatishahidin, Jana Košecká; Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops, 2024, pp. 1120-1130

Abstract


We address the task of American Sign Language fingerspelling translation using videos in the wild. We exploit advances in more accurate hand pose estimation and propose a novel architecture that leverages the transformer based encoder-decoder model enabling seamless contextual word translation. The translation model is augmented by a novel loss term that accurately predicts the length of the finger-spelled word, benefiting both training and inference. We also propose a novel two-stage inference approach that re-ranks the hypotheses using the language model capabilities of the decoder. Through extensive experiments, we demonstrate that our proposed method outperforms the state-of-the-art models on ChicagoFSWild and ChicagoFSWild+ achieving more than 10% relative improvement in performance. Our findings highlight the effectiveness of our approach and its potential to advance fingerspelling recognition in sign language translation.

Related Material


[pdf]
[bibtex]
@InProceedings{Fayyazsanavi_2024_WACV, author = {Fayyazsanavi, Pooya and Nejatishahidin, Negar and Ko\v{s}eck\'a, Jana}, title = {Fingerspelling PoseNet: Enhancing Fingerspelling Translation With Pose-Based Transformer Models}, booktitle = {Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Workshops}, month = {January}, year = {2024}, pages = {1120-1130} }