• Corpus ID: 7421075

LipNet: End-to-End Sentence-level Lipreading

@article{Assael2016LipNetES,
  title={LipNet: End-to-End Sentence-level Lipreading},
  author={Yannis Assael and Brendan Shillingford and Shimon Whiteson and Nando de Freitas},
  journal={arXiv: Learning},
  year={2016}
}
Lipreading is the task of decoding text from the movement of a speaker's mouth. [] Key Method Motivated by this observation, we present LipNet, a model that maps a variable-length sequence of video frames to text, making use of spatiotemporal convolutions, a recurrent network, and the connectionist temporal classification loss, trained entirely end-to-end. To the best of our knowledge, LipNet is the first end-to-end sentence-level lipreading model that simultaneously learns spatiotemporal visual features and…

Figures and Tables from this paper

Efficient End-to-End Sentence-Level Lipreading with Temporal Convolutional Networks
TLDR
An efficient end-to-end sentence-level lipreading model is proposed, using an encoder based on a 3D convolutional network, ResNet50, Temporal Convolutional Network (TCN), and a CTC objective function as the decoder that can partly eliminate the defects of RNN (LSTM, GRU) gradient disappearance and insufficient performance.
LCSNet: End-to-End Lipreading with Channel-aware Feature Selection
TLDR
An end-to-end deep neural network-based lipreading model (LCSNet) is proposed that extracts the short-term spatio-temporal features and the motion trajectory features from the lip region in the video clips and achieves an improvement of 52% and 47% compared to LipNet.
Lip-Corrector: Application of BERT-based model in sentence-level lipreading
TLDR
Experiments on the two largest sentence-level lipreading datasets of LRS2 and LRS3 show that the performance of this model surpasses all the baselines, which proves that lipreading methods combined with NLP technology will get better results.
On the Importance of Video Action Recognition for Visual Lipreading
TLDR
The spatial-temporal capacity power of I3D (Inflated 3D ConvNet) for visual lipreading is investigated and it is demonstrated that, after pre-trained on the large-scale video action recognition dataset (e.g., Kinetics), the models show a considerable improvement of performance on the task of lipreading.
Understanding Pictograph with Facial Features: End-to-End Sentence-Level Lip Reading of Chinese
TLDR
This paper implements visual-only Chinese lip reading of unconstrained sentences in a two-step end-to-end architecture (LipCH-Net), in which two deep neural network models are employed to perform the recognition of Pictureto-Pinyin and Pinyin- to-Hanzi respectively, before having a jointly optimization to improve the overall performance.
End-to-End Multi-View Lipreading
TLDR
This work presents an end-to-end multi-view lipreading system based on Bidirectional Long-Short Memory (BLSTM) networks, believed to be the first model which simultaneously learns to extract features directly from the pixels and performs visual speech classification from multiple views and also achieves state-of-the-art performance.
Lipreading with DenseNet and resBi-LSTM
TLDR
A simple method is introduced here to build a dataset for sentence-level Mandarin lipreading from programs like news, speech and talk show and proposes a model that is composed of a 3D convolutional layer with DenseNet and residual bidirectional long short-term memory.
Multi-modal Model for Sentence Level Captioning
TLDR
This work has developed a multi-modal model for captioning that integrates both auditory and visual cues and makes use of a DeepSpeech ASR Model and LipNet, an end-to-end sentence-level lipreading model, which are fused together with fully connected softmax and Connectionist Temporal Classification layers.
A Survey of Lipreading Methods Based on Deep Learning
TLDR
The convolution neural network structure in the front-end and the sequence processing model of the back-end are discussed and analyzed and the current lipreading datasets are introduced and the comparison of the methods used in these datasets are compared.
Multi-Grained Spatio-temporal Modeling for Lip-reading
TLDR
A novel lip-reading model which captures not only the nuance between words but also styles of different speakers, by a multi-grained spatio-temporal modeling of the speaking process is proposed.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 64 REFERENCES
Lipreading with long short-term memory
Lipreading, i.e. speech recognition from visual-only recordings of a speaker's face, can be achieved with a processing pipeline based solely on neural networks, yielding significantly better accuracy
Improved speaker independent lip reading using speaker adaptive training and deep neural networks
TLDR
It is shown that error-rates for speaker-independent lip-reading can be very significantly reduced and that there is no need to map phonemes to visemes for context-dependent visual speech transcription.
Lipreading using convolutional neural network
TLDR
The evaluation results of the isolated word recognition experiment demonstrate that the visual features acquired by the CNN significantly outperform those acquired by conventional dimensionality compression approaches, including principal component analysis.
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
TLDR
It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets.
Integration of deep bottleneck features for audio-visual speech recognition
TLDR
This paper proposes a method of integrating DBNFs using multi-stream HMMs in order to improve the performance of AVSRs under both clean and noisy conditions and evaluates the method using a continuously spo-ken, Japanese digit recognition task under matched and mismatched conditions.
Extraction of Visual Features for Lipreading
TLDR
Three methods for parameterizing lip image sequences for recognition using hidden Markov models are compared and two are top-down approaches that fit a model of the inner and outer lip contours and derive lipreading features from a principal component analysis of shape or shape and appearance, respectively.
Patch-Based Representation of Visual Speech
TLDR
By dealing with the mouth region in this manner, it is shown that by extracting more speech information from the visual domain, the relative word error rate on the CUAVE audio-visual corpus is improved.
Lipreading With Local Spatiotemporal Descriptors
TLDR
Local spatiotemporal descriptors are presented to represent and recognize spoken isolated phrases based solely on visual input to include local processing and robustness to monotonic gray-scale changes.
Continuous automatic speech recognition by lipreading
TLDR
While this study focuses on the feasibility, validity, and segregated contribution of exclusively continuous OASR, future highly robust recognition systems should combine optical and acoustic information with syntactic, semantic and pragmatic aids.
Towards End-To-End Speech Recognition with Recurrent Neural Networks
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the
...
1
2
3
4
5
...