Deep Audio-Visual Speech Recognition

  title={Deep Audio-Visual Speech Recognition},
  author={Triantafyllos Afouras and Joon Son Chung and Andrew W. Senior and Oriol Vinyals and Andrew Zisserman},
  journal={IEEE Transactions on Pattern Analysis and Machine Intelligence},
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. [] Key Result The models that we train surpass the performance of all previous work on lip reading benchmark datasets by a significant margin.

Sub-word Level Lip Reading With Visual Attention

This paper proposes an attention-based pooling mechanism to aggregate visual speech representations and proposes a model for Visual Speech Detection (VSD), trained on top of the lip reading network, significantly reducing the performance gap between lip reading and automatic speech recognition.

LIP-RTVE: An Audiovisual Database for Continuous Spanish in the Wild

A semi-automatically annotated audiovisual database to deal with unconstrained natural Spanish, providing 13 hours of data extracted from Spanish television and baseline results for both speaker-dependent and speaker-independent scenarios are reported.

Seeing wake words: Audio-visual Keyword Spotting

A novel convolutional architecture, KWS-Net, that uses a similarity map intermediate representation to separate the task into sequence matching, and pattern detection, to decide whether and when a word of interest is spoken by a talking face, with or without the audio.

Large-vocabulary Audio-visual Speech Recognition in Noisy Environments

This paper proposes a new fusion strategy: a recurrent integration network is trained to fuse the state posteriors of multiple single-modality models, guided by a set of model-based and signal-based stream reliability measures, used for stream integration within a hybrid recognizer.

Visual speech recognition for multiple languages in the wild

This work proposes the addition of prediction-based auxiliary tasks to a VSR model, and highlights the importance of hyperparameter optimization and appropriate data augmentations, and shows that such a model works for different languages and outperforms all previous methods trained on publicly available datasets by a large margin.

AVATAR: Unconstrained Audiovisual Speech Recognition

A new sequence-to-sequence AudioVisual ASR TrAnsformeR (AVATAR) is proposed which is trained end- to-end from spectrograms and full-frame RGB and demonstrates the contribution of the visual modality on the How2 AV-ASR benchmark, and shows that the model outperforms all other prior work by a large margin.

LiRA: Learning Visual Speech Representations from Audio through Self-supervision

This work trains a ResNet+Conformer model to predict acoustic features from unlabelled visual speech and finds that this pre-trained model can be leveraged towards word-level and sentence-level lip-reading through feature extraction and fine-tuning experiments.

Developing tools for audio-visual

This project focused on the lip reading aspect, which is a challenging task due to many sources of variation found in real-world environments, and investigated how well a pre-trained lip reading model can generalise to a new dataset.

Lip-to-Speech Synthesis for Arbitrary Speakers in the Wild

This work proposes a new VAE-GAN architecture that learns to associate the lip and speech sequences amidst the variations of silent lip videos, and learns to synthesize speech sequences in any voice for the lip movements of any person.

How to Teach DNNs to Pay Attention to the Visual Modality in Speech Recognition

The inner workings of AV Align are investigated and a regularisation method which involves predicting lip-related Action Units from visual representations is proposed which leads to better exploitation of the visual modality and encourages researchers to rethink the multimodal convergence problem when having one dominant modality.

Lip Reading Sentences in the Wild

The WLAS model trained on the LRS dataset surpasses the performance of all previous work on standard lip reading benchmark datasets, often by a significant margin, and it is demonstrated that if audio is available, then visual information helps to improve speech recognition performance.

Large-Scale Visual Speech Recognition

This work designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequence of phoneme distributions, and a production-level speech decoder that outputs sequences of words.

Deep multimodal learning for Audio-Visual Speech Recognition

An approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built is studied, demonstrating the tremendous value of the visual channel in phone classification even in audio with high signal to noise ratio.

End-to-End Audiovisual Speech Recognition

This is the first audiovisual fusion model which simultaneously learns to extract features directly from the image pixels and audio waveforms and performs within-context word recognition on a large publicly available dataset (LRW).

LipNet: Sentence-level Lipreading

To the best of the knowledge, LipNet is the first lipreading model to operate at sentence-level, using a single end-to-end speaker-independent deep model to simultaneously learn spatiotemporal visual features and a sequence model.

Deep Lip Reading: a comparison of models and an online application

The best performing model improves the state-of-the-art word error rate on the challenging BBC-Oxford Lip Reading Sentences 2 (LRS2) benchmark dataset by over 20 percent.

State-of-the-Art Speech Recognition with Sequence-to-Sequence Models

A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.

LipNet: End-to-End Sentence-level Lipreading

This work presents LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end.

Deep complementary bottleneck features for visual speech recognition

  • Stavros PetridisM. Pantic
  • Computer Science
    2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2016
This is the first work that extracts DBNFs for visual speech recognition directly from pixels based on deep autoencoders and the extracted complementary DBNF in combination with DCT features achieve the best performance.