Deep Audio-Visual Speech Recognition

@article{Afouras2018DeepAS,
  title={Deep Audio-Visual Speech Recognition},
  author={Triantafyllos Afouras and Joon Son Chung and Andrew W. Senior and Oriol Vinyals and Andrew Zisserman},
  journal={IEEE transactions on pattern analysis and machine intelligence},
  year={2018}
}
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. [] Key Result The models that we train surpass the performance of all previous work on lip reading benchmark datasets by a significant margin.
Large-Scale Visual Speech Recognition
TLDR
This work designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequence of phoneme distributions, and a production-level speech decoder that outputs sequences of words.
Sub-word Level Lip Reading With Visual Attention
TLDR
This paper proposes an attention-based pooling mechanism to aggregate visual speech representations and proposes a model for Visual Speech Detection (VSD), trained on top of the lip reading network, significantly reducing the performance gap between lip reading and automatic speech recognition.
Seeing wake words: Audio-visual Keyword Spotting
TLDR
A novel convolutional architecture, KWS-Net, that uses a similarity map intermediate representation to separate the task into sequence matching, and pattern detection, to decide whether and when a word of interest is spoken by a talking face, with or without the audio.
Recurrent Neural Network Transducer for Audio-Visual Speech Recognition
TLDR
This work presents a large-scale audio-visual speech recognition system based on a recurrent neural network transducer (RNN-T) architecture and significantly improves the state-of-the-art on the LRS3-TED set.
Large-vocabulary Audio-visual Speech Recognition in Noisy Environments
TLDR
This paper proposes a new fusion strategy: a recurrent integration network is trained to fuse the state posteriors of multiple single-modality models, guided by a set of model-based and signal-based stream reliability measures, used for stream integration within a hybrid recognizer.
LiRA: Learning Visual Speech Representations from Audio through Self-supervision
TLDR
This work trains a ResNet+Conformer model to predict acoustic features from unlabelled visual speech and finds that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments.
AVATAR: Unconstrained Audiovisual Speech Recognition
TLDR
This work proposes a new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) which is trained end- to-end from spectrograms and full-frame RGB, and demonstrates the contribution of the visual modality on the How2 AV-ASR benchmark, especially in the presence of simulated noise.
Visual Speech Recognition for Multiple Languages in the Wild
TLDR
This work proposes the addition of prediction-based auxiliary tasks to a VSR model and highlights the importance of hyper-parameter optimisation and appropriate data augmentations, and shows that such model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.
How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition
TLDR
The inner workings of AV Align are investigated and a regularisation method which involves predicting lip-related Action Units from visual representations is proposed which leads to better exploitation of the visual modality and encourages researchers to rethink the multimodal convergence problem when having one dominant modality.
Transformer-Based Video Front-Ends for Audio-Visual Speech Recognition
TLDR
This work proposes to replace the 3D convolution with a video transformer video feature extractor and achieves the state of the art performance of the audio-visual recognition on the LRS3TED after fine-tuning the model.
...
...

References

SHOWING 1-10 OF 61 REFERENCES
Lip Reading Sentences in the Wild
TLDR
The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin, and it is demonstrated that if audio is available, then visual information helps to improve speech recognition performance.
Large-Scale Visual Speech Recognition
TLDR
This work designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequence of phoneme distributions, and a production-level speech decoder that outputs sequences of words.
Deep multimodal learning for Audio-Visual Speech Recognition
TLDR
An approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built is studied, demonstrating the tremendous value of the visual channel in phone classification even in audio with high signal to noise ratio.
End-to-End Audiovisual Speech Recognition
TLDR
This is the first audiovisual fusion model which simultaneously learns to extract features directly from the image pixels and audio waveforms and performs within-context word recognition on a large publicly available dataset (LRW).
LipNet: Sentence-level Lipreading
TLDR
To the best of the knowledge, LipNet is the first lipreading model to operate at sentence-level, using a single end-to-end speaker-independent deep model to simultaneously learn spatiotemporal visual features and a sequence model.
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
  • C. Chiu, T. Sainath, M. Bacchiani
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
TLDR
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.
A review of recent advances in visual speech decoding
Deep Lip Reading: a comparison of models and an online application
TLDR
The best performing model improves the state-of-the-art word error rate on the challenging BBC-Oxford Lip Reading Sentences 2 (LRS2) benchmark dataset by over 20 percent.
LipNet: End-to-End Sentence-level Lipreading
TLDR
This work presents LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end.
Listen, Attend and Spell
TLDR
A neural network that learns to transcribe speech utterances to characters without making any independence assumptions between the characters, which is the key improvement of LAS over previous end-to-end CTC models.
...
...