• Corpus ID: 21010143

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

  title={Deep Voice 2: Multi-Speaker Neural Text-to-Speech},
  author={Andrew Gibiansky and Sercan {\"O}. Arik and Gregory Frederick Diamos and John Miller and Kainan Peng and Wei Ping and Jonathan Raiman and Yanqi Zhou},
We introduce a technique for augmenting neural text-to-speech (TTS) with lowdimensional trainable speaker embeddings to generate different voices from a single model. [] Key Method We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1.
Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
This work investigates a novel approach for generating high-quality speeches in multiple languages of speakers enrolled in their native language by introducing tone/stress embeddings which extend the language embedding to represent tone and stress information.
Deep Voice 3: 2000-Speaker Neural Text-to-Speech
Deep Voice 3 is presented, a fully-convolutional attention-based neural text-to-speech (TTS) system that matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster.
Adapting TTS models For New Speakers using Transfer Learning
It is found that fine-tuning a single-speaker TTS model on just 30 minutes of data, can yield comparable performance to a model trained from scratch on more than 27 hours of data for both male and female target speakers.
Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning
The proposed approach has the goal to overcome limitations trying to obtain a system which is able to model a multi-speaker acoustic space and allows the generation of speech audio similar to the voice of different target speakers, even if they were not observed during the training phase.
Low-Resource Expressive Text-To-Speech Using Data Augmentation
This work presents a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data in order to build expressive style voices with as little as 15 minutes of such recordings.
Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech
Model Agnostic Meta-Learning (MAML) is used as the training algorithm of a multi-speaker TTS model, which aims to find a great meta-initialization to adapt the model to any few-shot speaker adaptation tasks quickly and outperforms the speaker encoding baseline under the same training scheme.
Textless Speech-to-Speech Translation on Real Data
To the knowledge, this work is the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs, and finetunes a pre-trained speech encoder with paired audios from multiple speakers and a single reference speaker to reduce the variations due to accents.
GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis
This work proposes GANSpeech, which is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-Speaker T TS model, and proposes simple but efficient automatic scaling methods for feature matching loss used in adversarialTraining.
Multi Speaker Speech Synthesis System for Indonesian Language
  • M. J. BudimanD. Lestari
  • Computer Science
    2020 7th International Conference on Advance Informatics: Concepts, Theory and Applications (ICAICTA)
  • 2020
A multi speaker speech synthesis system is built for Indonesian language using Deep Voice 3 architecture, with several additional components for preprocessing dan post-processing and evaluated subjectively to assess naturalness, similarity to original speaker, and intelligibility of the produced speech.
Multi-speaker Sequence-to-sequence Speech Synthesis for Data Augmentation in Acoustic-to-word Speech Recognition
This work extends the speech synthesizer so that it can output speech of many speakers and demonstrates that the A2W model trained with the multi-speaker model achieved a significant improvement over the baseline and the single speaker model.


Deep Voice: Real-time Neural Text-to-Speech
Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.
Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis
This paper proposes an approach to model multiple speakers TTS with a general DNN, where the same hidden layers are shared among different speakers while the output layers are composed of speaker-dependent nodes explaining the target of each speaker.
Deep Speaker: an End-to-End Neural Speaker Embedding System
Results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition are presented, and it is suggested that Deep Speaker outperforms a DNN-based i-vector baseline.
Tacotron: Towards End-to-End Speech Synthesis
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
A study of speaker adaptation for DNN-based speech synthesis
An experimental analysis of speaker adaptation for DNN-based speech synthesis at different levels and systematically analyse the performance of each individual adaptation technique and that of their combinations.
WaveNet: A Generative Model for Raw Audio
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code
  • Ossama Abdel-HamidHui Jiang
  • Computer Science, Physics
    2013 IEEE International Conference on Acoustics, Speech and Signal Processing
  • 2013
A new fast speaker adaptation method for the hybrid NN-HMM speech recognition model that can achieve over 10% relative reduction in phone error rate by using only seven utterances for adaptation.
Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks
This paper proposes a non-parallel VC framework with a variational autoencoding Wasserstein generative adversarial network (VAW-GAN) that explicitly considers a VC objective when building the speech model.
Char2Wav: End-to-End Speech Synthesis
Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features.
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
It is shown that the model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature.