CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition

Sarah Alyami, Hamzah Luqman; Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops, 2025, pp. 4098-4108

Abstract


Continuous sign language recognition (CSLR) focuses on interpreting and transcribing sequences of sign language gestures in videos. In this work, we propose CLIP sign language adaptation (CLIP-SLA), a novel CSLR framework that leverages the powerful pre-trained visual encoder from the CLIP model to sign language tasks through parameter-efficient fine-tuning (PEFT). We introduce two variants, SLA-Adapter and SLA-LoRA, which integrate PEFT modules into the CLIP visual encoder, enabling fine-tuning with minimal trainable parameters. The effectiveness of the proposed frameworks is validated on four datasets: Phoenix2014, Phoenix2014-T, CSL-Daily, and Isharah-500, where both CLIP-SLA variants outperformed several SOTA models with fewer trainable parameters. Extensive ablation studies emphasize the effectiveness and flexibility of the proposed methods with different vision-language models for CSLR. These findings showcase the potential of adapting large-scale pre-trained models for scalable and efficient CSLR, which pave the way for future advancements in sign language understanding. Code is available at https://github.com/snalyami/CLIP-SLA.

Related Material


[pdf]
[bibtex]
@InProceedings{Alyami_2025_CVPR, author = {Alyami, Sarah and Luqman, Hamzah}, title = {CLIP-SLA: Parameter-Efficient CLIP Adaptation for Continuous Sign Language Recognition}, booktitle = {Proceedings of the Computer Vision and Pattern Recognition Conference (CVPR) Workshops}, month = {June}, year = {2025}, pages = {4098-4108} }