• Publications
  • Influence
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps
SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
TLDR
This work presents SpecAugment, a simple data augmentation method for speech recognition that is applied directly to the feature inputs of a neural network (i.e., filter bank coefficients) and achieves state-of-the-art performance on the LibriSpeech 960h and Swichboard 300h tasks, outperforming all prior work.
Conformer: Convolution-augmented Transformer for Speech Recognition
TLDR
This work proposes the convolution-augmented transformer for speech recognition, named Conformer, which significantly outperforms the previous Transformer and CNN based models achieving state-of-the-art accuracies.
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
TLDR
Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers.
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
TLDR
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
TLDR
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
WaveGrad: Estimating Gradients for Waveform Generation
TLDR
WaveGrad offers a natural way to trade inference speed for sample quality by adjusting the number of refinement steps, and bridges the gap between non-autoregressive and autoregressive models in terms of audio quality.
Training RNNs as Fast as CNNs
TLDR
The Simple Recurrent Unit architecture is proposed, a recurrent unit that simplifies the computation and exposes more parallelism, and is as fast as a convolutional layer and 5-10x faster than an optimized LSTM implementation.
Hierarchical Generative Modeling for Controllable Speech Synthesis
TLDR
A high-quality controllable TTS model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions is proposed.
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
TLDR
A multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages and be able to transfer voices across languages, e.g. English and Mandarin.
...
...