• Publications
  • Influence
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
TLDR
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
TLDR
Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
TLDR
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
TLDR
A novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker, by training two separate neural networks.
Hierarchical Generative Modeling for Controllable Speech Synthesis
TLDR
A high-quality controllable TTS model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions is proposed.
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
TLDR
A multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages and be able to transfer voices across languages, e.g. English and Mandarin.
Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling
TLDR
This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the Framework.
Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation
TLDR
It is demonstrated that a high quality end-to-end ST model can be trained using only weakly supervised datasets, and that synthetic data sourced from unlabeled monolingual text or speech can be used to improve performance.
Improved Noisy Student Training for Automatic Speech Recognition
TLDR
This work adapt and improve noisy student training for automatic speech recognition, employing (adaptive) SpecAugment as the augmentation method and finding effective methods to filter, balance and augment the data generated in between self-training iterations.
Direct speech-to-speech translation with a sequence-to-sequence model
We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text
...
1
2
3
...