• Corpus ID: 225066868

Improving Noise Robustness of an End-to-End Neural Model for Automatic Speech Recognition

  title={Improving Noise Robustness of an End-to-End Neural Model for Automatic Speech Recognition},
  author={Jagadeesh Balam and Jocelyn Huang and Vitaly Lavrukhin and Slyne Deng and Somshubra Majumdar and Boris Ginsburg},
  journal={arXiv: Audio and Speech Processing},
We present our experiments in training robust to noise an end-to-end automatic speech recognition (ASR) model using intensive data augmentation. We explore the efficacy of fine-tuning a pre-trained model to improve noise robustness, and we find it to be a very efficient way to train for various noisy conditions, especially when the conditions in which the model will be used, are unknown. Starting with a model trained on clean data helps establish baseline performance on clean speech. We… 
1 Citations

Figures and Tables from this paper

CarneliNet: Neural Mixture Model for Automatic Speech Recognition
CarneliNet is designed – a CTC-based neural network composed of three mega-blocks composed of multiple parallel shallow sub-networks based on 1D depthwise-separable convolutions that demonstrates that one can dynamically reconfigure the number of parallel sub-network to accommodate the computational requirements without retraining.


Toward Domain-Invariant Speech Recognition via Large Scale Training
This work explores the idea of building a single domain-invariant model for varied use-cases by combining large scale training data from multiple application domains, and shows that by using as little as 10 hours of data from a new domain, an adapted domain- Invariants model can match performance of a domain-specific model trained from scratch using 70 times as much data.
A study on data augmentation of reverberant speech for robust speech recognition
It is found that the performance gap between using simulated and real RIRs can be eliminated when point-source noises are added, and the trained acoustic models not only perform well in the distant- talking scenario but also provide better results in the close-talking scenario.
Deep Speech 2 : End-to-End Speech Recognition in English and Mandarin
It is shown that an end-to-end deep learning approach can be used to recognize either English or Mandarin Chinese speech-two vastly different languages, and is competitive with the transcription of human workers when benchmarked on standard datasets.
JHU ASpIRE system: Robust LVCSR with TDNNS, iVector adaptation and RNN-LMS
This paper tackles the problem of reverberant speech recognition using 5500 hours of simulated reverberant data using time-delay neural network (TDNN) architecture, which is capable of tackling long-term interactions between speech and corrupting sources in reverberant environments.
Quartznet: Deep Automatic Speech Recognition with 1D Time-Channel Separable Convolutions
A new end-to-end neural acoustic model for automatic speech recognition that achieves near state-of-the-art accuracy on LibriSpeech and Wall Street Journal, while having fewer parameters than all competing models.
An Overview of Noise-Robust Automatic Speech Recognition
A thorough overview of modern noise-robust techniques for ASR developed over the past 30 years is provided and methods that are proven to be successful and that are likely to sustain or expand their future applicability are emphasized.
Building and Evaluation of a Real Room Impulse Response Dataset
It is shown that a limited number of real R IRs, carefully selected to match the target environment, provide results comparable to a large number of artificially generated RIRs, and that both sets can be combined to achieve the best ASR results.
Common Voice: A Massively-Multilingual Speech Corpus
This work presents speech recognition experiments using Mozilla’s DeepSpeech Speech-to-Text toolkit, and finds an average Character Error Rate improvement for twelve target languages, for most of these languages, these are the first ever published results on end- to-end Automatic Speech Recognition.
Librispeech: An ASR corpus based on public domain audio books
It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.
MLS: A Large-Scale Multilingual Dataset for Speech Research
This paper introduces Multilingual LibriSpeech (MLS) dataset, a large multilingual corpus suitable for speech research and believes such a large transcribed dataset will open new avenues in ASR and Text-To-Speech research.