-
[pdf]
[supp]
[bibtex]@InProceedings{Shen_2025_ICCV, author = {Shen, Xin and Wang, Xinyu and Shen, Lei and Zhang, Kaihao and Yu, Xin}, title = {Cross-View Isolated Sign Language Recognition via View Synthesis and Feature Disentanglement}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2025}, pages = {20647-20657} }
Cross-View Isolated Sign Language Recognition via View Synthesis and Feature Disentanglement
Abstract
Cross-view isolated sign language recognition (CV-ISLR) addresses the challenge of identifying isolated signs from viewpoints unseen during training, a problem aggravated by the scarcity of multi-view data in existing benchmarks. To bridge this gap, we introduce a novel two-stage framework comprising View Synthesis and Contrastive Multi-task View-Semantics Recognition. In the View Synthesis stage, we simulate unseen viewpoints by extracting 3D keypoints from the front-view training dataset and synthesizing common-view 2D skeleton sequences with virtual camera rotation, which enriches view diversity without the cost of multi-camera setups. However, direct training on these synthetic samples leads to limited improvement, as viewpoint-specific and semantics-specific features remain entangled. To overcome this drawback, we present a Contrastive Multi-task View-Semantics Recognition (CMVSR) module that disentangles viewpoint-dependent features from sign semantics. In this way, CMVSR obtains view-invariant representations of the sign video, leading to robust recognition performance against various camera viewpoints. We evaluate our approach on the MM-WLAuslan dataset, the first benchmark for CV-ISLR, and on our extended protocol MTV-Test that includes additional multi-view data captured in the wild. Experimental results demonstrate that our method not only improves the accuracy of front-view skeleton-based isolated sign language recognition, but also exhibits superior generalization to novel viewpoints.
Related Material
