Author pages are created from data sourced from our academic publisher partnerships and public sources.
Share This Author
Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis
"global style tokens" (GSTs), a bank of embeddings that are jointly trained within Tacotron, a state-of-the-art end-to-end speech synthesis system, learn to factorize noise and speaker identity, providing a path towards highly scalable but robust speech synthesis.
LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech
Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking
A novel system that separates the voice of a target speaker from multi-speaker signals, by making use of a reference signal from the target speaker, by training two separate neural networks.
Hierarchical Generative Modeling for Controllable Speech Synthesis
A high-quality controllable TTS model which can control latent attributes in the generated speech that are rarely annotated in the training data, such as speaking style, accent, background noise, and recording conditions is proposed.
Learning to Speak Fluently in a Foreign Language: Multilingual Speech Synthesis and Cross-Language Voice Cloning
A multispeaker, multilingual text-to-speech (TTS) synthesis model based on Tacotron that is able to produce high quality speech in multiple languages and be able to transfer voices across languages, e.g. English and Mandarin.
Lingvo: a Modular and Scalable Framework for Sequence-to-Sequence Modeling
This document outlines the underlying design of Lingvo and serves as an introduction to the various pieces of the framework, while also offering examples of advanced features that showcase the capabilities of the Framework.
Leveraging Weakly Supervised Data to Improve End-to-end Speech-to-text Translation
- Ye Jia, Melvin Johnson, Yonghui Wu
- Computer ScienceICASSP - IEEE International Conference on…
- 5 November 2018
It is demonstrated that a high quality end-to-end ST model can be trained using only weakly supervised datasets, and that synthetic data sourced from unlabeled monolingual text or speech can be used to improve performance.
Improved Noisy Student Training for Automatic Speech Recognition
This work adapt and improve noisy student training for automatic speech recognition, employing (adaptive) SpecAugment as the augmentation method and finding effective methods to filter, balance and augment the data generated in between self-training iterations.
Direct speech-to-speech translation with a sequence-to-sequence model
We present an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text…