Synt++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition

  title={Synt++: Utilizing Imperfect Synthetic Data to Improve Speech Recognition},
  author={Ting-yao Hu and Mohammadreza Armandpour and Ashish Shrivastava and Jen-Hao Rick Chang and Hema Swetha Koppula and Oncel Tuzel},
With recent advances in speech synthesis, synthetic data is becoming a viable alternative to real data for training speech recognition models. However, machine learning with synthetic data is not trivial due to the gap between the synthetic and the real data distributions. Synthetic datasets may contain artifacts that do not exist in real data such as structured noise, content errors, or unrealistic speaking styles. Moreover, the synthesis process may introduce a bias due to uneven sampling of… 

Figures and Tables from this paper


Training Keyword Spotters with Limited and Synthesized Speech Data
This paper uses a pre-trained speech embedding model trained to extract useful features for keyword spotting models, and shows that a model which detects 10 keywords when trained on only synthetic speech is equivalent to a model trained on over 500 real examples.
Generating Synthetic Audio Data for Attention-Based Speech Recognition Systems
This work extends state-of-the-art attention-based automatic speech recognition (ASR) systems with synthetic audio generated by a TTS system trained only on the ASR corpora itself, closing the gap to a comparable oracle experiment by more than 50%.
Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection
This work proposes to combine generative adversarial network (GAN) and multi-style training (MTR) to increase acoustic diversity in the synthesized data and presents a contrastive language model-based data selection technique to improve the efficiency of learning from unspoken text.
Speaker Augmentation for Low Resource Speech Recognition
  • Chenpeng Du, Kai Yu
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
Experiments on a switchboard task show that, given 50 hours of data, the proposed speaker augmentation with SpecAugment significantly reduces word error rate (WER) by 30% relative compared to the system without any data augmentation, and about 18%.
Speech Recognition with Augmented Synthesized Speech
  • A. Rosenberg, Yu Zhang, Zelin Wu
  • Computer Science, Physics
    2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
  • 2019
This paper finds that improvements to speech recognition performance is achievable by augmenting training data with synthesized material, however, there remains a substantial gap in performance between recognizers trained on human speech those trained on synthesized speech.
Leveraging Sequence-to-Sequence Speech Synthesis for Enhancing Acoustic-to-Word Speech Recognition
This paper explores how the current speech synthesis technology can be leveraged to tailor the ASR system for a target domain by preparing only a relevant text corpus and generates speech features using a sequence-to-sequence speech synthesizer.
Cycle-consistency Training for End-to-end Speech Recognition
This paper presents a method to train end-to-end automatic speech recognition (ASR) models using unpaired data using a Text-To-Encoder model and defines a loss based on the encoder reconstruction error, which reduced the word error rate by 14.7% on the LibriSpeech corpus.
Self-supervised Sequence-to-sequence ASR using Unpaired Speech and Text
This work proposes a new semi-supervised loss combining an end-to-end differentiable ASR loss that is able to leverage both unpaired speech and text data to outperform recently proposed related techniques in terms of \%WER.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
Back-Translation-Style Data Augmentation for end-to-end ASR
Inspired by the back-translation technique proposed in the field of machine translation, a neural text-to-encoder model is built which predicts a sequence of hidden states extracted by a pre-trained E2E-ASR encoder from asequence of characters.