An Investigation of End-to-End Models for Robust Speech Recognition

  title={An Investigation of End-to-End Models for Robust Speech Recognition},
  author={Archiki Prasad and Preethi Jyothi and Rajbabu Velmurugan},
  journal={ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  • Archiki Prasad, P. Jyothi, R. Velmurugan
  • Published 11 February 2021
  • Computer Science
  • ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
End-to-end models for robust automatic speech recognition (ASR) have not been sufficiently well-explored in prior work. With end-to-end models, one could choose to preprocess the input speech using speech enhancement techniques and train the model using enhanced speech. Another alternative is to pass the noisy speech as input and modify the model architecture to adapt to noisy speech. A systematic comparison of these two approaches for end-to-end robust ASR has not been attempted before. We… 

Figures and Tables from this paper

Recent Advances in End-to-End Automatic Speech Recognition
This paper overviews the recent advances in E2E models, focusing on technologies addressing those challenges from the industry’s perspective.
A Noise-Robust Self-supervised Pre-training Model Based Speech Representation Learning for Automatic Speech Recognition
Experimental results reveal that the proposed enhanced wav2vec2.0 model can not only improve the ASR performance on the noisy test set which surpasses the originals, but also ensure a tiny performance decrease on the clean test set.
Multiple Confidence Gates For Joint Training Of SE And ASR
Joint training of speech enhancement model (SE) and speech recognition model (ASR) is a common solution for robust ASR in noisy environments. SE focuses on improving the auditory quality of speech,


Jointly Adversarial Enhancement Training for Robust End-to-End Speech Recognition
This paper proposes a jointly adversarial enhancement training to boost robustness of end-to-end systems and achieves the relative error rate reduction of 4.6% over the multi-condition training.
Learning Noise Invariant Features Through Transfer Learning For Robust End-to-End Speech Recognition
This work argues that the clean classifier can force the feature extractor to learn the underlying noise invariant patterns in the noisy dataset, and proposes transfer learning from a clean dataset (WSJ) to a noisy dataset (CHiME4) for connectionist temporal classification models.
SEGAN: Speech Enhancement Generative Adversarial Network
This work proposes the use of generative adversarial networks for speech enhancement, and operates at the waveform level, training the model end-to-end, and incorporate 28 speakers and 40 different noise conditions into the same model, such that model parameters are shared across them.
Deep Xi as a Front-End for Robust Automatic Speech Recognition
  • Aaron Nicolson, K. Paliwal
  • Computer Science
    2020 IEEE Asia-Pacific Conference on Computer Science and Data Engineering (CSDE)
  • 2020
The experimental investigation of Deep Xi as a frontend for robust ASR shows that Deep Xi is a viable front-end, and is able to significantly increase the robustness of an ASR system.
Unsupervised domain adaptation for robust speech recognition via variational autoencoder-based data augmentation
This paper addresses the unsupervised domain adaptation problem for robust speech recognition, where both source and target domain speech are available, but word transcripts are only available for the source domain speech.
How Accents Confound: Probing for Accent Information in End-to-End Speech Recognition Systems
This work uses a state-of-the-art end-to-end ASR system that is trained on a large amount of US-accented English speech, and examines the effects of accent on the internal representation using three main probing techniques.
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets.
An investigation of deep neural networks for noise robust speech recognition
The noise robustness of DNN-based acoustic models can match state-of-the-art performance on the Aurora 4 task without any explicit noise compensation and can be further improved by incorporating information about the environment into DNN training using a new method called noise-aware training.
Deep Learning for Environmentally Robust Speech Recognition: An Overview of Recent Developments
Recently developed, representative deep learning approaches for tackling non-stationary additive and convolutional degradation of speech with the aim of providing guidelines for those involved in the development of environmentally robust speech recognition systems are reviewed.
Joint noise adaptive training for robust automatic speech recognition
  • A. Narayanan, Deliang Wang
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
By formulating separation as a supervised mask estimation problem, a unified DNN framework is developed that jointly improves separation and acoustic modeling and improves performance on the Aurora-4 dataset.