Point-Supervised Japanese Fingerspelling Localization via HR-Pro and Contrastive Learning

Ryota Murai, Naoto Tsuta, Duk Shin, Yousun Kang; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, 2025, pp. 4975-4982

Abstract


Japanese fingerspelling recognition faces unique challenges due to rapid temporal transitions and subtle hand shape variations difficult to capture with conventional methods. We address these challenges by applying Hierarchical Reliability Propagation (HR-Pro), a point-supervised Temporal Action Localization (TAL) method that requires only sparse point-level annotations instead of dense frame-wise labels. We enhance the HR-Pro framework through three key innovations: (1) replacing the I3D encoder with pretrained VideoMAE v2 for superior temporal dynamics modeling without optical flow, (2) introducing point-supervised contrastive learning inspired by SimCLR for robust feature discrimination, and (3) incorporating 20-dimensional joint angle features from MediaPipe for explicit kinematic modeling. While the dataset used in this work has been employed internally, we formally introduce it as a public resource featuring continuous and isolated fingerspelling sequences with diverse phonetic coverage and signer variation. Our top model achieves up to 93.4% mAP at tIoU 0.1-0.5, using a pretrained VideoMAE v2 enhanced with our point-supervised contrastive learning, outperforming I3D-based baselines while reducing computational complexity. Notably, incorporating joint angle features with I3D yields over a 30 percentage point improvement, demonstrating the value of kinematic cues. Ablation studies reveal interesting redundancy between angle features and contrastive pretraining, highlighting the importance of modality-aware fusion strategies. Code is available at: https://github.com/tpu-kanglabs/ub-hrpro

Related Material


[pdf]
[bibtex]
@InProceedings{Murai_2025_ICCV, author = {Murai, Ryota and Tsuta, Naoto and Shin, Duk and Kang, Yousun}, title = {Point-Supervised Japanese Fingerspelling Localization via HR-Pro and Contrastive Learning}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops}, month = {October}, year = {2025}, pages = {4975-4982} }