Lip Reading Sentences in the Wild

@article{Chung2017LipRS,
  title={Lip Reading Sentences in the Wild},
  author={Joon Son Chung and Andrew W. Senior and Oriol Vinyals and Andrew Zisserman},
  journal={2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2017},
  pages={3444-3453}
}
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio. Unlike previous works that have focussed on recognising a limited number of words or phrases, we tackle lip reading as an open-world problem – unconstrained natural language sentences, and in the wild videos. Our key contributions are: (1) a Watch, Listen, Attend and Spell (WLAS) network that learns to transcribe videos of mouth motion to characters, (2) a curriculum… 
Deep Audio-Visual Speech Recognition
TLDR
This work compares two models for lip reading, one using a CTC loss, and the other using a sequence-to-sequence loss, built on top of the transformer self-attention architecture.
Learning to lip read words by watching videos
Sub-word Level Lip Reading With Visual Attention
TLDR
This paper proposes an attention-based pooling mechanism to aggregate visual speech representations and proposes a model for Visual Speech Detection (VSD), trained on top of the lip reading network, significantly reducing the performance gap between lip reading and automatic speech recognition.
Lip Reading Sentences Using Deep Learning With Only Visual Cues
TLDR
A neural network-based lip reading system designed to lip read sentences covering a wide range of vocabulary and to recognise words that may not be included in system training has achieved a significantly improved performance with 15% lower word error rate.
Experimenting with lipreading for large vocabulary continuous speech recognition
  • Karel Palecek
  • Computer Science
    Journal on Multimodal User Interfaces
  • 2018
TLDR
It is shown that even for large vocabularies the visual signal contains enough information to improve the word accuracy up to 22% relatively to the acoustic-only recognition.
Large-Scale Visual Speech Recognition
TLDR
This work designed and trained an integrated lipreading system, consisting of a video processing pipeline that maps raw video to stable videos of lips and sequences of phonemes, a scalable deep neural network that maps the lip videos to sequence of phoneme distributions, and a production-level speech decoder that outputs sequences of words.
A Lip Reading Model Using CNN with Batch Normalization
TLDR
A lip reading model for an audio-less video data of variable-length sequence frames with a twelve-layer Convolutional Neural Network with two layer of batch normalization for training the model and to extract the visual features end-to-end.
Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis
TLDR
This work proposes a novel approach with key design choices to achieve accurate, natural lip to speech synthesis in such unconstrained scenarios for the first time and shows that its method is four times more intelligible than previous works in this space.
Word Spotting in Silent Lip Videos
TLDR
A pipeline for recognition-free retrieval is developed, and a query expansion technique using pseudo-relevant feedback and a novel re-ranking method based on maximizing the correlation between spatio-temporal landmarks of the query and the top retrieval candidates are proposed.
Deep Learning for Lip Reading using Audio-Visual Information for Urdu Language
TLDR
This project has tried to train two different deep-learning models for lip-reading: first one for video sequences using spatiotemporal convolution neural network, Bi-gated recurrent neural network and Connectionist Temporal Classification Loss, and second for audio that inputs the MFCC features to a layer of LSTM cells and output the sequence.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 52 REFERENCES
LipNet: Sentence-level Lipreading
TLDR
To the best of the knowledge, LipNet is the first lipreading model to operate at sentence-level, using a single end-to-end speaker-independent deep model to simultaneously learn spatiotemporal visual features and a sequence model.
LipNet: End-to-End Sentence-level Lipreading
TLDR
This work presents LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end.
Listen, Attend and Spell
TLDR
A neural network that learns to transcribe speech utterances to characters without making any independence assumptions between the characters, which is the key improvement of LAS over previous end-to-end CTC models.
Comparing visual features for lipreading
TLDR
It is found that shape alone is a useful cue for lipreading (which is consistent with human experiments), however, the incremental effect of shape on appearance appears to be not significant which implies that the inner appearance of the mouth contains more information than the shape.
A review of recent advances in visual speech decoding
Lipreading with long short-term memory
Lipreading, i.e. speech recognition from visual-only recordings of a speaker's face, can be achieved with a processing pipeline based solely on neural networks, yielding significantly better accuracy
Attention-Based Models for Speech Recognition
TLDR
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate.
Lipreading using convolutional neural network
TLDR
The evaluation results of the isolated word recognition experiment demonstrate that the visual features acquired by the CNN significantly outperform those acquired by conventional dimensionality compression approaches, including principal component analysis.
Deep multimodal learning for Audio-Visual Speech Recognition
TLDR
An approach where uni-modal deep networks are trained separately and their final hidden layers fused to obtain a joint feature space in which another deep network is built is studied, demonstrating the tremendous value of the visual channel in phone classification even in audio with high signal to noise ratio.
Lip Reading in the Wild
TLDR
The aim is to recognise the words being spoken by a talking face, given only the video but not the audio, in a controlled environment.
...
1
2
3
4
5
...