Deep Voice 2: Multi-Speaker Neural Text-to-Speech
@inproceedings{Gibiansky2017DeepV2, title={Deep Voice 2: Multi-Speaker Neural Text-to-Speech}, author={Andrew Gibiansky and Sercan {\"O}. Arik and Gregory Frederick Diamos and John Miller and Kainan Peng and Wei Ping and Jonathan Raiman and Yanqi Zhou}, booktitle={NIPS}, year={2017} }
We introduce a technique for augmenting neural text-to-speech (TTS) with lowdimensional trainable speaker embeddings to generate different voices from a single model. [] Key Method We introduce Deep Voice 2, which is based on a similar pipeline with Deep Voice 1, but constructed with higher performance building blocks and demonstrates a significant audio quality improvement over Deep Voice 1.
Figures and Tables from this paper
350 Citations
Neural Text-to-Speech Adaptation from Low Quality Public Recordings
- Computer Science10th ISCA Workshop on Speech Synthesis (SSW 10)
- 2019
This work introduces meta-learning to adapt the neural TTS front-end and shows that for low quality public recordings, the adaptation based on the multi-speaker corpus can generate a cleaner target voice in comparison with the speaker-dependent model.
Multi-Lingual Multi-Speaker Text-to-Speech Synthesis for Voice Cloning with Online Speaker Enrollment
- Computer ScienceINTERSPEECH
- 2020
This work investigates a novel approach for generating high-quality speeches in multiple languages of speakers enrolled in their native language by introducing tone/stress embeddings which extend the language embedding to represent tone and stress information.
Deep Voice 3: 2000-Speaker Neural Text-to-Speech
- Computer ScienceICLR 2018
- 2017
Deep Voice 3 is presented, a fully-convolutional attention-based neural text-to-speech (TTS) system that matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster.
Adapting TTS models For New Speakers using Transfer Learning
- PhysicsArXiv
- 2021
It is found that fine-tuning a single-speaker TTS model on just 30 minutes of data, can yield comparable performance to a model trained from scratch on more than 27 hours of data for both male and female target speakers.
Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning
- Computer Science, PhysicsArXiv
- 2021
The proposed approach has the goal to overcome limitations trying to obtain a system which is able to model a multi-speaker acoustic space and allows the generation of speech audio similar to the voice of different target speakers, even if they were not observed during the training phase.
Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers
- Linguistics, Physics
- 2019
It is found that normalizing speaker embedding x-vectors by L2-norm normalization or whitening improves output quality a lot in many cases, and the WaveNet performance seems to be language-independent: the authors' WaveNet is trained with Cantonese speech and can be used to generate Mandarin and English speech very well.
Meta-TTS: Meta-Learning for Few-Shot Speaker Adaptive Text-to-Speech
- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2022
Model Agnostic Meta-Learning (MAML) is used as the training algorithm of a multi-speaker TTS model, which aims to find a great meta-initialization to adapt the model to any few-shot speaker adaptation tasks quickly and outperforms the speaker encoding baseline under the same training scheme.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
- Computer ScienceNeurIPS
- 2018
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis
- Computer Science, PhysicsInterspeech
- 2021
This work proposes GANSpeech, which is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-Speaker T TS model, and proposes simple but efficient automatic scaling methods for feature matching loss used in adversarialTraining.
Textless Speech-to-Speech Translation on Real Data
- Computer ScienceArXiv
- 2021
This work is the first to establish a textless S2ST technique that can be trained with real-world data and works for multiple language pairs, and is a self-supervised unit-based speech normalization technique.
References
SHOWING 1-10 OF 25 REFERENCES
Deep Voice: Real-time Neural Text-to-Speech
- Computer ScienceICML
- 2017
Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.
Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis
- Computer Science2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2015
This paper proposes an approach to model multiple speakers TTS with a general DNN, where the same hidden layers are shared among different speakers while the output layers are composed of speaker-dependent nodes explaining the target of each speaker.
Deep Speaker: an End-to-End Neural Speaker Embedding System
- Computer Science, PhysicsArXiv
- 2017
Results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition are presented, and it is suggested that Deep Speaker outperforms a DNN-based i-vector baseline.
Tacotron: Towards End-to-End Speech Synthesis
- Computer ScienceINTERSPEECH
- 2017
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis
- Computer ScienceIEEE Transactions on Audio, Speech, and Language Processing
- 2009
A speaker-adaptive HMM-based speech synthesis system that employs speaker adaptation, feature-space adaptive training, mixed-gender modeling, and full-covariance modeling using CSMAPLR transforms, in addition to several other techniques that have proved effective in previous systems are described.
A study of speaker adaptation for DNN-based speech synthesis
- Computer ScienceINTERSPEECH
- 2015
An experimental analysis of speaker adaptation for DNN-based speech synthesis at different levels and systematically analyse the performance of each individual adaptation technique and that of their combinations.
WaveNet: A Generative Model for Raw Audio
- Computer ScienceSSW
- 2016
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
On the training of DNN-based average voice model for speech synthesis
- Computer Science2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)
- 2016
This work performs a systematic analysis of the training of multi-speaker average voice model (AVM), which is the foundation of adaptability and controllability of a DNN-based speech synthesis system.
Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code
- Computer Science, Physics2013 IEEE International Conference on Acoustics, Speech and Signal Processing
- 2013
A new fast speaker adaptation method for the hybrid NN-HMM speech recognition model that can achieve over 10% relative reduction in phone error rate by using only seven utterances for adaptation.
Voice Conversion from Unaligned Corpora Using Variational Autoencoding Wasserstein Generative Adversarial Networks
- Computer ScienceINTERSPEECH
- 2017
This paper proposes a non-parallel VC framework with a variational autoencoding Wasserstein generative adversarial network (VAW-GAN) that explicitly considers a VC objective when building the speech model.