• Corpus ID: 37048542

Lipreading using convolutional neural network

@inproceedings{Noda2014LipreadingUC,
  title={Lipreading using convolutional neural network},
  author={Kuniaki Noda and Yuki Yamaguchi and Kazuhiro Nakadai and HIroshi G. Okuno and Tetsuya Ogata},
  booktitle={INTERSPEECH},
  year={2014}
}
In recent automatic speech recognition studies, deep learning architecture applications for acoustic modeling have eclipsed conventional sound features such as Mel-frequency cepstral coefficients. [] Key Method By training a CNN with images of a speaker’s mouth area in combination with phoneme labels, the CNN acquires multiple convolutional filters, used to extract visual features essential for recognizing phonemes.

Figures and Tables from this paper

Lipreading Architecture Based on Multiple Convolutional Neural Networks for Sentence-Level Visual Speech Recognition
TLDR
A novel lipreading architecture that combines three different convolutional neural networks (CNNs), which are followed by a two-layer bi-directional gated recurrent unit and shows improved performance even when visual ambiguity arises, thereby increasing VSR reliability for practical applications.
Visual Recognition of Continuous Cued Speech Using a Tandem CNN-HMM Approach
TLDR
In its best configuration, and without exploiting any dictionary or language model, the proposed tandem CNN-HMM architecture is able to identify correctly more than 73% of the phoneme (62% when considering insertion errors).
Feature extraction using multimodal convolutional neural networks for visual speech recognition
  • E. Tatulli, T. Hueber
  • Computer Science
    2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2017
TLDR
This article addresses the problem of continuous speech recognition from visual information only, without exploiting any audio signal, and investigates the use of convolutional neural networks to extract visual features directly from the raw ultrasound and video images.
EXTRACTION USING MULTIMODAL CONVOLUTIONAL NEURAL NETWORKS FOR VISUAL SPEECH RECOGNITION
TLDR
This article addresses the problem of continuous speech recognition from visual information only, without exploiting any audio signal, and investigates the use of convolutional neural networks to extract visual features directly from the raw ultrasound and video images.
Visual Speech Recognition of Lips Images Using Convolutional Neural Network in VGG-M Model
TLDR
This paper presented a Visual Speech Recognition of lips images using Convolutional Neural Network in VGG-M model which has achieved a validation accuracy of 87% for seen test and 30% test accuracy for the unseen test.
Speech Recognition Using Historian Multimodal Approach
TLDR
The experimental results show that the early integration between audio and visual features achieved an obvious enhancement in there recognition accuracy and prove that BiLSTM is the most effective classification technique when compared to HMM.
Spatiotemporal Convolutional Features for Lipreading
TLDR
A visual parametrization method for the task of lipreading and audiovisual speech recognition from frontal face videos using learned spatiotemporal convolutions in a deep neural network that is trained to predict phonemes on a frame level is proposed.
MobiLipNet: Resource-Efficient Deep Learning Based Lipreading
TLDR
This paper investigates the MobileNet convolutional neural network architectures, recently proposed for image classification, and extends the 2D convolutions of MobileNets to 3D ones, in order to better model the spatio-temporal nature of the lipreading problem.
LIP-READING VIA DEEP NEURAL NETWORKS USING HYBRID VISUAL FEATURES
TLDR
Resulted accuracies demonstrate that the proposed lipreading model outperforms the conventional Hidden Markov Model (HMM) and competes well with the state-of-the-art visual speech recognition works.
Deep Audio-visual System for Closed-set Word-level Speech Recognition
TLDR
Experiments on LRW-1000 dataset have substantially demonstrated that the proposed joint training scheme by audio-visual incorporation is capable of enhancing the recognition performance of relatively short duration samples, unveiling the multi-modal complementarity.
...
...

References

SHOWING 1-10 OF 22 REFERENCES
Estimating phoneme class conditional probabilities from raw speech signal using convolutional neural networks
TLDR
This paper investigates a novel approach, where the input to the ANN is raw speech signal and the output is phoneme class conditional probability estimates, and indicates that CNNs can learn features relevant for phoneme classification automatically from the rawspeech signal.
Extraction of Visual Features for Lipreading
TLDR
Three methods for parameterizing lip image sequences for recognition using hidden Markov models are compared and two are top-down approaches that fit a model of the inner and outer lip contours and derive lipreading features from a principal component analysis of shape or shape and appearance, respectively.
Feature analysis for automatic speechreading
  • P. Scanlon, R. Reilly
  • Computer Science
    2001 IEEE Fourth Workshop on Multimedia Signal Processing (Cat. No.01TH8564)
  • 2001
TLDR
It was observed that static features alone outperform a combination of both static and dynamic features when restricting the dimension of the feature vector, illustrating that the need for a certain level of detail in visual speech recognition is a higher priority than dynamic information.
Rapid and effective speaker adaptation of convolutional neural network based models for speech recognition
TLDR
Experimental results on the TIMIT dataset demonstrates that both methods are quite effective in terms of adapting CNN based acoustic models and can achieve even better performance by combining these two methods together.
Deep Neural Networks for Acoustic Modeling in Speech Recognition
TLDR
This paper provides an overview of this progress and repres nts the shared views of four research groups who have had recent successes in using deep neural networks for a coustic modeling in speech recognition.
Improving visual features for lip-reading
TLDR
It is demonstrated that, by careful choice of technique, the effects of inter-speaker variability in the visual features can be reduced, which improves significantly the recognition accuracy of an automated lip-reading system.
A comparison of model and transform-based visual features for audio-visual LVCSR
TLDR
Four different visual speech parameterisation methods are compared on a large vocabulary, continuous, audio-visual speech recognition task using the IBM ViaVoice TM audio- visual speech database, using an active appearance model to track and obtain model parameters describing the entire face.
Automatic speech recognition improved by two-layered audio-visual integration for robot audition
TLDR
Two-layered audio-visual integration to make automatic speech recognition (ASR) more robust against speaker's distance and interfering talkers or environmental noises is presented.
Comparison of low- and high-level visual features for audio-visual continuous automatic speech recognition
  • P. Aleksic, A. Katsaggelos
  • Computer Science
    2004 IEEE International Conference on Acoustics, Speech, and Signal Processing
  • 2004
TLDR
Conclusions are drawn on the trade off between the dimensionality of the visual features and the amount of speechreading information contained in them and its influence on the AV-ASR performance.
ImageNet classification with deep convolutional neural networks
TLDR
A large, deep convolutional neural network was trained to classify the 1.2 million high-resolution images in the ImageNet LSVRC-2010 contest into the 1000 different classes and employed a recently developed regularization method called "dropout" that proved to be very effective.
...
...