C2ST: Cross-Modal Contextualized Sequence Transduction for Continuous Sign Language Recognition

Huaiwen Zhang, Zihang Guo, Yang Yang, Xin Liu, De Hu; Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 21053-21062

Abstract


Continuous Sign Language Recognition (CSLR) aims to transcribe the signs of an untrimmed video into written words or glosses. The mainstream framework for CSLR consists of a spatial module for visual representation learning, a temporal module aggregating the local and global temporal information of frame sequence, and the connectionist temporal classification (CTC) loss, which aligns video features with gloss sequence. Unfortunately, the language prior implicit in the gloss sequence is ignored throughout the modeling process. Furthermore, the contextualization of glosses is further ignored in alignment learning, as CTC makes an independence assumption between glosses. In this paper, we propose a Cross-modal Contextualized Sequence Transduction (C2ST) for CSLR, which effectively incorporates the knowledge of gloss sequence into the process of video representation learning and sequence transduction. Specifically, we introduce a cross-modal context learning framework for CSLR, in which the linguistic features of gloss sequences is extracted by a language model, and recurrently integrate with visual features for video modelling. Moreover, we introduce the contextualized sequence transduction loss that incorporates the contextual information of gloss sequences in label prediction, without making any independence assumptions between the glosses. Our method sets the new state of the art on three widely used large-scale sign language recognition datasets: Phoenix-2014, Phoenix-2014-T, and CSL-Daily. On CSL-Daily, our approach achieves an absolute gain of 4.9% WER compared to the best published results.

Related Material


[pdf]
[bibtex]
@InProceedings{Zhang_2023_ICCV, author = {Zhang, Huaiwen and Guo, Zihang and Yang, Yang and Liu, Xin and Hu, De}, title = {C2ST: Cross-Modal Contextualized Sequence Transduction for Continuous Sign Language Recognition}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023}, pages = {21053-21062} }