Weakly-supervised Fingerspelling Recognition in British Sign Language Videos

  title={Weakly-supervised Fingerspelling Recognition in British Sign Language Videos},
  author={Prajwal K R and Hannah Bull and Liliane Momeni and Samuel Albanie and G{\"u}l Varol and Andrew Zisserman},
  booktitle={British Machine Vision Conference},
The goal of this work is to detect and recognize sequences of letters signed using fingerspelling in British Sign Language (BSL). Previous fingerspelling recognition methods have not focused on BSL, which has a very different signing alphabet (e.g., two-handed instead of one-handed) to American Sign Language (ASL). They also use manual annotations for training. In contrast to previous methods, our method only uses weak annotations from subtitles for training. We localize potential instances of… 

Figures and Tables from this paper



American Sign Language Fingerspelling Recognition in the Wild

This work introduces the largest data set available so far for the problem of fingerspelling recognition, and the first using naturally occurring video data, and presents the first attempt to recognize fingerspelling sequences in this challenging setting.

Lexicon-free fingerspelling recognition from video: Data, models, and signer adaptation

Searching for fingerspelled content in American Sign Language

This paper proposes an end-to-end model for this task, FSS-Net, that jointly detects fingerspelling and matches it to a text sequence and significantly outperforms baseline methods adapted from prior work on related tasks.

Fingerspelling Recognition in the Wild With Iterative Visual Attention

This work proposes an end-to-end model based on an iterative attention mechanism, without explicit hand detection or segmentation, that out-performs prior work by a large margin on recognition of fingerspelling sequences in ASL videos collected in the wild.

Multitask training with unlabeled data for end-to-end sign language fingerspelling recognition

  • Bowen ShiKaren Livescu
  • Computer Science
    2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
  • 2017
A model for fingerspelling recognition that consists of an auto-encoder-based feature extractor and an attention-based neural encoder-decoder, which are trained jointly, achieves 11.6% and 4.4% absolute letter accuracy improvement over previous approaches that required frame-level training labels.

BSL-1K: Scaling up co-articulated sign language recognition using mouthing cues

A new scalable approach to data collection for sign recognition in continuous videos is introduced, and it is shown that BSL-1K can be used to train strong sign recognition models for co-articulated signs in BSL and that these models additionally form excellent pretraining for other sign languages and benchmarks.

Sign Language Fingerspelling Recognition using Synthetic Data

This model is based on a pretrained convolutional network, fine-tuned using synthetic images, and tested using a corpus dataset of real recordings of native signers, achieving an accuracy of 71% recognition.

Learning sign language by watching TV (using weakly aligned subtitles)

This work proposes a distance function to match signing sequences which includes the trajectory of both hands, the hand shape and orientation, and properly models the case of hands touching and shows that by optimizing a scoring function based on multiple instance learning, it is able to extract the sign of interest from hours of signing footage, despite the very weak and noisy supervision.

Automatic recognition of fingerspelled words in British Sign Language

  • Stephan LiwickiM. Everingham
  • Computer Science
    2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops
  • 2009
This work investigates the problem of recognizing words from video, fingerspelled using the British Sign Language (BSL) fingerspelling alphabet, and achieves a word recognition accuracy of 98.9% on a dataset of 1,000 low quality webcam videos of 100 words.

Aligning Subtitles in Sign Language Videos

This work proposes a Transformer architecture tailored for this task, which trains on manually annotated alignments covering over 15K subtitles that span 17.7 hours of video, and opens up possibilities for advancing machine translation of sign languages via providing continuously synchronized video-text data.