SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition

  title={SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition},
  author={Daniel S. Park and William Chan and Yu Zhang and Chung-Cheng Chiu and Barret Zoph and Ekin Dogus Cubuk and Quoc V. Le},
We present SpecAugment, a simple data augmentation method for speech recognition. [] Key Result For Switchboard, we achieve 7.2%/14.6% on the Switchboard/CallHome portion of the Hub5'00 test set without the use of a language model, and 6.8%/14.1% with shallow fusion, which compares to the previous state-of-the-art hybrid system at 8.3%/17.3% WER.

Figures and Tables from this paper

SpecSwap: A Simple Data Augmentation Method for End-to-End Speech Recognition
SpecSwap is presented, a simple data augmentation scheme for automatic speech recognition that acts directly on the spectrogram of input utterances that can be applied on Transformer-based networks for end-to-end speech recognition task.
MixSpeech: Data Augmentation for Low-Resource Automatic Speech Recognition
Experimental results show that MixSpeech achieves better accuracy than the baseline models without data augmentation, and outperforms a strongData augmentation method SpecAugment on these recognition tasks.
SpeechStew: Simply Mix All Available Speech Recognition Data to Train One Large Neural Network
SpeechStew is a speech recognition model that is trained on a combination of various publicly available speech recognition datasets: AMI, Broadcast News, Common Voice, LibriSpeech, Switchboard/Fisher, Tedlium, and Wall Street Journal, and it is demonstrated that SpeechStew learns powerful transfer learning representations.
Frame-Level Specaugment for Deep Convolutional Neural Networks in Hybrid ASR Systems
It is demonstrated that f-SpecAugment is more effective than the utterance level SpecAugment for deep CNN based hybrid models and has benefits approximately equivalent to doubling the amount of training data for deepCNNs.
Comparing the Benefit of Synthetic Training Data for Various Automatic Speech Recognition Architectures
This work investigates the effect of varying pre-processing, the speaker embedding and input encoding of the TTS system w.r.t. the effectiveness of the synthesized data for AED-ASR training, and compares the results to a state-of-the-art hybrid ASR system, a monophone based system using connectionist-temporal-classification (CTC) and a monotonic transducer based system.
A Comparison of Streaming Models and Data Augmentation Methods for Robust Speech Recognition
A comparative study on the robustness of two different online streaming speech recognition models: Monotonic Chunkwise Attention (MoChA) and Recurrent Neural Network-Transducer (RNN-T).
Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems
This work extends state-of-the-art attention-based automatic speech recognition (ASR) systems with synthetic audio generated by a TTS system trained only on the ASR corpora itself, closing the gap to a comparable oracle experiment by more than 50%.
SynthASR: Unlocking Synthetic Data for Speech Recognition
The observa-tions show that SynthASR holds great promise in training the state-of-the-art large-scale E2E ASR models for new applications while reducing the costs and dependency on production data.
A Neural Acoustic Echo Canceller Optimized Using An Automatic Speech Recognizer and Large Scale Synthetic Data
This work augments the loss function with a term that produces outputs useful to a pre-trained ASR model and shows that this augmented loss function improves WER metrics, and demonstrates that augmenting the training dataset of real world examples with a large synthetic dataset improves performance.
Specaugment on Large Scale Datasets
  • Daniel S. Park, Yu Zhang, Yonghui Wu
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
This paper demonstrates its effectiveness on tasks with large scale datasets by investigating its application to the Google Multidomain Dataset and introduces a modification of SpecAugment that adapts the time mask size and/or multiplicity depending on the length of the utterance, which can potentially benefit large scale tasks.


Deep Speech: Scaling up end-to-end speech recognition
Deep Speech, a state-of-the-art speech recognition system developed using end-to-end deep learning, outperforms previously published results on the widely studied Switchboard Hub5'00, achieving 16.0% error on the full test set.
Improved training of end-to-end attention models for speech recognition
This work introduces a new pretraining scheme by starting with a high time reduction factor and lowering it during training, which is crucial both for convergence and final performance, and trains long short-term memory (LSTM) language models on subword units.
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional
State-of-the-Art Speech Recognition with Sequence-to-Sequence Models
  • C. Chiu, T. Sainath, M. Bacchiani
  • Computer Science
    2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2018
A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.
Building Competitive Direct Acoustics-to-Word Models for English Conversational Speech Recognition
A joint word-character A2W model that learns to first spell the word and then recognize it and provides a rich output to the user instead of simple word hypotheses, making it especially useful in the case of words unseen or rarely-seen during training.
Vocal Tract Length Perturbation (VTLP) improves speech recognition
Improvements in speech recognition are suggested without increasing the number of training epochs, and it is suggested that data transformations should be an important component of training neural networks for speech, especially for data limited projects.
The CAPIO 2017 Conversational Speech Recognition System
This paper shows how the state-of-the-art performance on the industry-standard NIST 2000 Hub5 English evaluation set is achieved, and proposes an acoustic model adaptation scheme that simply averages the parameters of a seed neural network acoustic model and its adapted version.
A Perceptually Inspired Data Augmentation Method for Noise Robust CNN Acoustic Models
A data augmentation method that improves the robustness of convolutional neural network-based speech recognizers to additive noise by introducing two simple heuristics that select the less reliable components of the spectrum of the speech signal as candidates for dropout.
Letter-Based Speech Recognition with Gated ConvNets
A new speech recognition system, leveraging a simple letter-based ConvNet acoustic model, which shows near state-of-the-art results in word error rate on the LibriSpeech corpus using log-mel filterbanks, both on the "clean" and "other" configurations.