Fusion Architectures for Word-Based Audiovisual Speech Recognition

  title={Fusion Architectures for Word-Based Audiovisual Speech Recognition},
  author={Michael Wand and J{\"u}rgen Schmidhuber},
In this study we investigate architectures for modality fusion in audiovisual speech recognition, where one aims to alleviate the adverse effect of acoustic noise on the speech recognition accuracy by using video images of the speaker’s face as an additional modality. Starting from an established neural network fusion system, we substantially improve the recognition accuracy by taking single-modality losses into account: late fusion (at the output logits level) is substantially more robust than… 

Figures and Tables from this paper

Towards a practical lip-to-speech conversion system using deep neural networks and mobile application frontend
A system built from a backend for deep neural network training and inference and a fronted as a form of a mobile application for lip-to-speech synthesis, making sure that the speaking impaired might be able to communicate with this solution.


Attention-based Audio-Visual Fusion for Robust Automatic Speech Recognition
This paper proposes an audio-visual fusion strategy that goes beyond simple feature concatenation and learns to automatically align the two modalities, leading to enhanced representations which increase the recognition accuracy in both clean and noisy conditions.
Investigations on End- to-End Audiovisual Fusion
Investigation of the saliency of the input features shows that the neural network automatically adapts to different noise levels in the acoustic signal, and the fusion system outperforms single-modality recognition under all noise conditions.
Comparing Fusion Models for DNN-Based Audiovisual Continuous Speech Recognition
  • A. H. Abdelaziz
  • Computer Science
    IEEE/ACM Transactions on Audio, Speech, and Language Processing
  • 2018
NTCD-TIMIT with its freely available visual features and 37 clean and noisy acoustic signals allows for this study to be a common benchmark, to which novel LVCSR AV-ASR models and approaches can be compared.
Lipreading using convolutional neural network
The evaluation results of the isolated word recognition experiment demonstrate that the visual features acquired by the CNN significantly outperform those acquired by conventional dimensionality compression approaches, including principal component analysis.
Deep complementary bottleneck features for visual speech recognition
  • S. Petridis, M. Pantic
  • Computer Science
    2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2016
This is the first work that extracts DBNFs for visual speech recognition directly from pixels based on deep autoencoders and the extracted complementary DBNF in combination with DCT features achieve the best performance.
Improving Speaker-Independent Lipreading with Domain-Adversarial Training
A Lipreading system, i.e. a speech recognition system using only visual features, which uses domain-adversarial training for speaker independence, yielding an end-to-end trainable system which only requires a very small number of frames of untranscribed target data to substantially improve the recognition accuracy on the target speaker.
Motion Dynamics Improve Speaker-Independent Lipreading
We present a novel lipreading system that improves on the task of speaker-independent word recognition by decoupling motion and content dynamics. We achieve this by implementing a deep learning
A Large-Scale Open-Source Acoustic Simulator for Speaker Recognition
While error rates increase considerably under degraded speech conditions, large relative equal error rate (EER) reductions were observed when using a PLDA model trained with a large number of degraded sessions per speaker.
Lipreading with long short-term memory
Lipreading, i.e. speech recognition from visual-only recordings of a speaker's face, can be achieved with a processing pipeline based solely on neural networks, yielding significantly better accuracy
An audio-visual corpus for speech perception and automatic speech recognition.
An audio-visual corpus that consists of high-quality audio and video recordings of 1000 sentences spoken by each of 34 talkers to support the use of common material in speech perception and automatic speech recognition studies.