• Corpus ID: 21010143

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

@inproceedings{Gibiansky2017DeepV2,
  title={Deep Voice 2: Multi-Speaker Neural Text-to-Speech},
  author={Andrew Gibiansky and Sercan {\"O}. Arik and Gregory Frederick Diamos and John Miller and Kainan Peng and Wei Ping and Jonathan Raiman and Yanqi Zhou},
  booktitle={NIPS},
  year={2017}
}
We introduce a technique for augmenting neural text-to-speech (TTS) with lowdimensional trainable speaker embeddings to generate different voices from a single model. [] Key Method We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1.
Neural Text-to-Speech Adaptation from Low Quality Public Recordings
TLDR
This work introduces meta-learning to adapt the neural TTS front-end and shows that for low quality public recordings, the adaptation based on the multi-speaker corpus can generate a cleaner target voice in comparison with the speaker-dependent model.
Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
TLDR
This work investigates a novel approach for generating high-quality speeches in multiple languages of speakers enrolled in their native language by introducing tone/stress embeddings which extend the language embedding to represent tone and stress information.
Deep Voice 3: 2000-Speaker Neural Text-to-Speech
TLDR
Deep Voice 3 is presented, a fully-convolutional attention-based neural text-to-speech (TTS) system that matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster.
Adapting TTS models For New Speakers using Transfer Learning
TLDR
It is found that fine-tuning a single-speaker TTS model on just 30 minutes of data, can yield comparable performance to a model trained from scratch on more than 27 hours of data for both male and female target speakers.
Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning
TLDR
The proposed approach has the goal to overcome limitations trying to obtain a system which is able to model a multi-speaker acoustic space and allows the generation of speech audio similar to the voice of different target speakers, even if they were not observed during the training phase.
Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers
TLDR
It is found that normalizing speaker embedding x-vectors by L2-norm normalization or whitening improves output quality a lot in many cases, and the WaveNet performance seems to be language-independent: the authors' WaveNet is trained with Cantonese speech and can be used to generate Mandarin and English speech very well.
Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech
TLDR
Model Agnostic Meta-Learning (MAML) is used as the training algorithm of a multi-speaker TTS model, which aims to find a great meta-initialization to adapt the model to any few-shot speaker adaptation tasks quickly and outperforms the speaker encoding baseline under the same training scheme.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
TLDR
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis
TLDR
This work proposes GANSpeech, which is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-Speaker T TS model, and proposes simple but efficient automatic scaling methods for feature matching loss used in adversarialTraining.
Textless Speech-to-Speech Translation on Real Data
TLDR
This work is the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs, and is a self-supervised unit-based speech normalization technique.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 25 REFERENCES
Deep Voice: Real-time Neural Text-to-Speech
TLDR
Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.
Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis
TLDR
This paper proposes an approach to model multiple speakers TTS with a general DNN, where the same hidden layers are shared among different speakers while the output layers are composed of speaker-dependent nodes explaining the target of each speaker.
Deep Speaker: an End-to-End Neural Speaker Embedding System
TLDR
Results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition are presented, and it is suggested that Deep Speaker outperforms a DNN-based i-vector baseline.
Tacotron: Towards End-to-End Speech Synthesis
TLDR
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis
TLDR
A speaker-adaptive HMM-based speech synthesis system that employs speaker adaptation, feature-space adaptive training, mixed-gender modeling, and full-covariance modeling using CSMAPLR transforms, in addition to several other techniques that have proved effective in previous systems are described.
A study of speaker adaptation for DNN-based speech synthesis
TLDR
An experimental analysis of speaker adaptation for DNN-based speech synthesis at different levels and systematically analyse the performance of each individual adaptation technique and that of their combinations.
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
On the training of DNN-based average voice model for speech synthesis
  • Shan Yang, Zhizheng Wu, Lei Xie
  • Computer Science
    2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)
  • 2016
TLDR
This work performs a systematic analysis of the training of multi-speaker average voice model (AVM), which is the foundation of adaptability and controllability of a DNN-based speech synthesis system.
Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code
  • Ossama Abdel-Hamid, Hui Jiang
  • Computer Science, Physics
    2013 IEEE International Conference on Acoustics, Speech and Signal Processing
  • 2013
TLDR
A new fast speaker adaptation method for the hybrid NN-HMM speech recognition model that can achieve over 10% relative reduction in phone error rate by using only seven utterances for adaptation.
Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks
TLDR
This paper proposes a non-parallel VC framework with a variational autoencoding Wasserstein generative adversarial network (VAW-GAN) that explicitly considers a VC objective when building the speech model.
...
1
2
3
...