Watch, read and lookup: learning to spot signs from multiple supervisors

Liliane Momeni, Gul Varol, Samuel Albanie, Triantafyllos Afouras, Andrew Zisserman; Proceedings of the Asian Conference on Computer Vision (ACCV), 2020


The focus of this work is sign spotting--given a video of an isolated sign, our task is to identify whether and where it has been signed in a continuous, co-articulated sign language video. To achieve this sign spotting task, we train a model using multiple types of available supervision by: (1) watching existing sparsely labelled footage; (2) reading associated subtitles (readily available translations of the signed content) which provide additional weak-supervision; (3) looking up words (for which no co-articulated labelled examples are available) in visual sign language dictionaries to enable novel sign spotting. These three tasks are integrated into a unified learning framework using the principles of Noise Contrastive Estimation and Multiple Instance Learning. We validate the effectiveness of our approach on low-shot sign spotting benchmarks. In addition, we contribute a machine-readable British Sign Language (BSL) dictionary dataset of isolated signs, BSLDict, to facilitate study of this task. The dataset, models and code are available at our project page

Related Material

[pdf] [supp] [arXiv] [code]
@InProceedings{Momeni_2020_ACCV, author = {Momeni, Liliane and Varol, Gul and Albanie, Samuel and Afouras, Triantafyllos and Zisserman, Andrew}, title = {Watch, read and lookup: learning to spot signs from multiple supervisors}, booktitle = {Proceedings of the Asian Conference on Computer Vision (ACCV)}, month = {November}, year = {2020} }