-
[pdf]
[bibtex]@InProceedings{Papadimitriou_2025_ICCV, author = {Papadimitriou, Katerina and Filntisis, Panagiotis and Retsinas, George and Potamianos, Gerasimos and Maragos, Petros}, title = {Seeing in 2D, Thinking in 3D: 3D Hand Mesh-Guided Feature Learning for Continuous Fingerspelling}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {6735-6744} }
Seeing in 2D, Thinking in 3D: 3D Hand Mesh-Guided Feature Learning for Continuous Fingerspelling
Abstract
Recognizing continuous fingerspelling from monocular RGB video is a highly challenging task due to complex hand articulation, coarticulation effects, and significant inter-signer variability. Prior methods use either raw visual features, which lack structural awareness of fine-grained finger dynamics, or parallel RGB-pose streams from explicit pose estimation, which add substantial inference-time overhead. In this work, we propose a novel knowledge distillation framework that transfers rich hand articulation knowledge from HAMER, a foundation model for 3D hand mesh/pose reconstruction, into a lightweight, RGB-only fingerspelling recognizer. We extract high-level pose embeddings from HAMER's Transformer head, which encode detailed hand structure, and distill them into a ResNet34-based appearance encoder via a dedicated training objective. Subsequently, the learned pose-aware features are fed into a 1D-CNN and BiGRU for temporal modeling, with the full system trained using both connectionist temporal classification (CTC) and a knowledge distillation loss. Notably, our approach does not rely on the teacher model (HAMER) at inference time, thus enabling real-time performance. We evaluate our method on two American sign language (ASL) benchmark fingerspelling datasets, as well as a studio-quality Greek fingerspelling corpus. Our model achieves state-of-the-art accuracy with over 3x lower inference time than prior methods, offering an effective trade-off between accuracy and efficiency for real-time deployment.
Related Material
