Speech Recognition with Augmented Synthesized Speech

@article{Rosenberg2019SpeechRW,
  title={Speech Recognition with Augmented Synthesized Speech},
  author={Andrew Rosenberg and Yu Zhang and Bhuvana Ramabhadran and Ye Jia and Pedro J. Moreno and Yonghui Wu and Zelin Wu},
  journal={2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)},
  year={2019},
  pages={996-1002}
}
  • A. RosenbergYu Zhang Zelin Wu
  • Published 25 September 2019
  • Computer Science, Physics
  • 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU)
Recent success of the Tacotron speech synthesis architecture and its variants in producing natural sounding multi-speaker synthesized speech has raised the exciting possibility of replacing expensive, manually transcribed, domain-specific, human speech that is used to train speech recognizers. The multi-speaker speech synthesis architecture can learn latent embedding spaces of prosody, speaker and style variations derived from input acoustic representations thereby allowing for manipulation of… 

Figures and Tables from this paper

Improving Speech Recognition Using Consistent Predictions on Synthesized Speech

  • Gary WangA. Rosenberg P. Moreno
  • Computer Science
    ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2020
It is demonstrated that promoting consistent predictions in response to real and synthesized speech enables significantly improved speech recognition performance and suggests that with this approach, reliance on transcribed audio can be cut nearly in half.

Improving Speech Recognition Using GAN-Based Speech Synthesis and Contrastive Unspoken Text Selection

This work proposes to combine generative adversarial network (GAN) and multi-style training (MTR) to increase acoustic diversity in the synthesized data and presents a contrastive language model-based data selection technique to improve the efficiency of learning from unspoken text.

Articulatory Synthesis for Data Augmentation in Phoneme Recognition

It is shown that the additional synthetic data can lead to a significantly better performance in single-phoneme recognition in certain cases, while at the same time, the performance can also decrease in other cases, depending on the degree of acoustic naturalness of the synthetic phonemes.

Semi-supervised ASR based on Iterative Joint Training with Discrete Speech Synthesis

The proposed Iterative Joint Training with discrete speech synthesis for semi-supervised ASR successfully improved recognition performance by using discrete speech representations instead of conventional acoustic features in IJT experiments with a single-speaker speech corpus.

Synthesis Speech Based Data Augmentation for Low Resource Children ASR

The results show that the augmentation improves the system performance over the baseline system and the data scarcity issue of the low resourced language Punjabi is addressed through two lev-els of augmentation.

Speech Synthesis as Augmentation for Low-Resource ASR

This paper investigates the possibility of using synthesized speech as a form of data augmentation to lower the resources necessary to build a speech recognizer.

Extending Parrotron: An End-to-End, Speech Conversion and Speech Recognition Model for Atypical Speech

An extended Parrotron model is presented: a single, end-to-end network that enables voice conversion and recognition simultaneously, and how these methods generalize across 8 types of atypical speech for a range of speech impairment severities is shown.

Using Speech Synthesis to Train End-To-End Spoken Language Understanding Models

This work proposes a strategy to overcome this requirement in which speech synthesis is used to generate a large synthetic training dataset from several artificial speakers, and confirms the effectiveness of this approach with experiments on two open-source SLU datasets.

Machine Speech Chain

To the best of the knowledge, this is the first deep learning framework that integrates human speech perception and production behaviors and significantly improved performance over that from separate systems that were only trained with labeled data.

SynAug: Synthesis-Based Data Augmentation for Text-Dependent Speaker Verification

Experimental results show that for i-vector framework, the proposed methods can boost the system performance significantly, especially for the low-resource condition where the amount of genuine speech is extremely limited.
...

References

SHOWING 1-10 OF 20 REFERENCES

Training Neural Speech Recognition Systems with Synthetic Speech Augmentation

A very large end-to-end neural speech recognition models are trained using the LibriSpeech dataset augmented with synthetic speech to achieve state of the art Word Error Rate for character-level based models without an external language model.

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

Listening while speaking: Speech chain by deep learning

This work develops the first deep learning model that integrates human speech perception and production behaviors and shows that the proposed approach significantly improved the performance more than separate systems that were only trained with labeled data.

Machine Speech Chain with One-shot Speaker Adaptation

This paper presents a new speech chain mechanism by integrating a speaker recognition model inside the loop and proposes extending the capability of TTS to handle unseen speakers by implementing one-shot speaker adaptation.

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps

State-of-the-Art Speech Recognition with Sequence-to-Sequence Models

A variety of structural and optimization improvements to the Listen, Attend, and Spell model are explored, which significantly improve performance and a multi-head attention architecture is introduced, which offers improvements over the commonly-used single- head attention.

Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis

"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.

Data Augmentation for Deep Neural Network Acoustic Modeling

Two data augmentation approaches, vocal tract length perturbation (VTLP) and stochastic feature mapping (SFM) for deep neural network acoustic modeling based on label-preserving transformations to deal with data sparsity are investigated.

Hierarchical Generative Modeling for Controllable Speech Synthesis

A high-quality controllable TTS model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions is proposed.

Librispeech: An ASR corpus based on public domain audio books

It is shown that acoustic models trained on LibriSpeech give lower error rate on the Wall Street Journal (WSJ) test sets than models training on WSJ itself.