Investigations on End- to-End Audiovisual Fusion

  title={Investigations on End- to-End Audiovisual Fusion},
  author={Michael Wand and Ngoc Thang Vu and Juergen Schmidhuber},
  journal={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
Audiovisual speech recognition (AVSR) is a method to alleviate the adverse effect of noise in the acoustic signal. Leveraging recent developments in deep neural network-based speech recognition, we present an AVSR neural network architecture which is trained end-to-end, without the need to separately model the process of decision fusion as in conventional (e.g. HMM-based) systems. The fusion system outperforms single-modality recognition under all noise conditions. Investigation of the saliency… 

Figures and Tables from this paper

Fusion Architectures for Word-Based Audiovisual Speech Recognition
This study investigates architectures for modality fusion in audiovisual speech recognition by using video images of the speaker’s face as an additional modality and substantially improves the recognition accuracy by taking single-modality losses into account.
Cogans For Unsupervised Visual Speech Adaptation To New Speakers
This work is the first to explore the visual adaptation of an SI-AVSR system to an unknown and unlabelled speaker and uses Coupled Generative Adversarial Networks to automatically learn a joint distribution of multi-domain images.
A Survey of Lipreading Methods Based on Deep Learning
The convolution neural network structure in the front-end and the sequence processing model of the back-end are discussed and analyzed and the current lipreading datasets are introduced and the comparison of the methods used in these datasets are compared.
Motion Dynamics Improve Speaker-Independent Lipreading
We present a novel lipreading system that improves on the task of speaker-independent word recognition by decoupling motion and content dynamics. We achieve this by implementing a deep learning
A Survey of Research on Lipreading Technology
Typical deep learning methods on lipreading are analyzed in detail according to their structural characteristics, and existing lipreading databases are listed, including their detailed information and the methods applied to these databases.
Visual Speech Recognition with Lightweight Psychologically Motivated Gabor Features
A new, lightweight feature extraction approach, motivated by human-centric glimpse-based psychological research into facial barcodes, is presented and it is demonstrated that these simple, easy to extract 3D geometric features, produced using Gabor-based image patches, can successfully be used for speech recognition with LSTM-based machine learning.
Improving the Recognition Performance of Lip Reading Using the Concatenated Three Sequence Keyframe Image Technique
The C3-SKI technique could be applied to perform lip reading recognition and showed accuracy, validation accuracy, loss, and validation loss values at 95.06%, 86.03%, 4.61%, and 9.04% respectively for the THDigits dataset.
Investigations on audiovisual emotion recognition in noisy conditions
The results show a significant performance decrease when a model trained on clean audio is applied to noisy data and that the addition of visual features alleviates this effect.


Improving audio-visual speech recognition using deep neural networks with dynamic stream reliability estimates
It is shown that this state-based integration scheme is superior to early integration of multi-modal features, even if early integration also includes the proposed reliability estimate, and able to outperform a fixed weighting approach that exploits oracle knowledge of the true signal-to-noise ratio.
Audio-visual deep learning for noise robust speech recognition
  • Jing Huang, Brian Kingsbury
  • Computer Science
    2013 IEEE International Conference on Acoustics, Speech and Signal Processing
  • 2013
This work uses DBNs for audio-visual speech recognition; in particular, it uses deep learning from audio and visual features for noise robust speech recognition and test two methods for using DBN’s in a multimodal setting.
"Eigenlips" for robust speech recognition
  • C. Bregler, Y. Konig
  • Physics, Computer Science
    Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing
  • 1994
This study improves the performance of a hybrid connectionist speech recognition system by incorporating visual information about the corresponding lip movements by using a new visual front end, and an alternative architecture for combining the visual and acoustic information.
Integration of deep bottleneck features for audio-visual speech recognition
This paper proposes a method of integrating DBNFs using multi-stream HMMs in order to improve the performance of AVSRs under both clean and noisy conditions and evaluates the method using a continuously spo-ken, Japanese digit recognition task under matched and mismatched conditions.
Improving Speaker-Independent Lipreading with Domain-Adversarial Training
A Lipreading system, i.e. a speech recognition system using only visual features, which uses domain-adversarial training for speaker independence, yielding an end-to-end trainable system which only requires a very small number of frames of untranscribed target data to substantially improve the recognition accuracy on the target speaker.
Neural network lipreading system for improved speech recognition
  • D. Stork, G. Wolff, E. Levine
  • Physics, Computer Science
    [Proceedings 1992] IJCNN International Joint Conference on Neural Networks
  • 1992
A modified time-delay neural network (TDNN) has been designed to perform both automatic lipreading (speech reading) in conjunction with acoustic speech recognition in order to improve recognition
Lipreading using convolutional neural network
The evaluation results of the isolated word recognition experiment demonstrate that the visual features acquired by the CNN significantly outperform those acquired by conventional dimensionality compression approaches, including principal component analysis.
Dynamic Stream Weighting for Turbo-Decoding-Based Audiovisual ASR
This work introduces a strategy for estimating optimal weights for the audio and video streams in turbo-decodingbased ASR using a discriminative cost function and shows that turbo decoding with this maximally discrim inative dynamic weighting of information yields higher recognition accuracy than turbo- decoding-based recognition with fixed stream weights or optimally dynamically weighted audiovisual decoding using coupled hidden Markov models.
Recent advances in the automatic recognition of audiovisual speech
The main components of audiovisual automatic speech recognition (ASR) are reviewed and novel contributions in two main areas are presented: first, the visual front-end design, based on a cascade of linear image transforms of an appropriate video region of interest, and subsequently, audiovISual speech integration.
Lipreading with long short-term memory
Lipreading, i.e. speech recognition from visual-only recordings of a speaker's face, can be achieved with a processing pipeline based solely on neural networks, yielding significantly better accuracy