• Corpus ID: 48363067

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

@inproceedings{Jia2018TransferLF,
  title={Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis},
  author={Ye Jia and Yan Zhang and Ron J. Weiss and Quan Wang and Jonathan Shen and Fei Ren and Z. Chen and Patrick Nguyen and Ruoming Pang and Ignacio Lopez-Moreno and Yonghui Wu},
  booktitle={NeurIPS},
  year={2018}
}
We describe a neural network-based system for text-to-speech (TTS) synthesis that is able to generate speech audio in the voice of many different speakers, including those unseen during training. [...] Key Method Our system consists of three independently trained components: (1) a speaker encoder network, trained on a speaker verification task using an independent dataset of noisy speech from thousands of speakers without transcripts, to generate a fixed-dimensional embedding vector from seconds of reference…Expand
Multi-speaker Sequence-to-sequence Speech Synthesis for Data Augmentation in Acoustic-to-word Speech Recognition
TLDR
This work extends the speech synthesizer so that it can output speech of many speakers and demonstrates that the A2W model trained with the multi-speaker model achieved a significant improvement over the baseline and the single speaker model.
Neural Text-to-Speech Adaptation from Low Quality Public Recordings
TLDR
This work introduces meta-learning to adapt the neural TTS front-end and shows that for low quality public recordings, the adaptation based on the multi-speaker corpus can generate a cleaner target voice in comparison with the speaker-dependent model.
A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation
TLDR
Surprisingly, adaptation with untranscribed speech surpassed the transcribed counterpart in the subjective test, which reveals the limitations of the conventional acoustic model and hints at potential directions for improvements.
Voice Cloning: a Multi-Speaker Text-to-Speech Synthesis Approach based on Transfer Learning
TLDR
The proposed approach has the goal to overcome limitations trying to obtain a system which is able to model a multi-speaker acoustic space and allows the generation of speech audio similar to the voice of different target speakers, even if they were not observed during the training phase.
Speaker Adaptation of Acoustic Model using a Few Utterances in DNN-based Speech Synthesis Systems
TLDR
This paper presents a novel technique to estimate a speaker-specific model using a partial copy of the speaker-independent model by creating a separate parallel branch stemmed from the intermediate hidden layer of the base network.
Comparative Study on Neural Vocoders for Multispeaker Text-To-Speech Synthesis
Recent researches on multispeaker text-to-speech synthesis allow cloning a voice unseen during training without retraining the model with new speech samples. Multispeaker text-to-speech synthesis
Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers
TLDR
It is found that normalizing speaker embedding x-vectors by L2-norm normalization or whitening improves output quality a lot in many cases, and the WaveNet performance seems to be language-independent: the authors' WaveNet is trained with Cantonese speech and can be used to generate Mandarin and English speech very well.
Training Multi-Speaker Neural Text-to-Speech Systems using Speaker-Imbalanced Speech Corpora
TLDR
Using an ensemble multi-speaker model, in which each subsystem is trained on a subset of available data, can further improve the quality of the synthetic speech especially for underrepresented speakers whose training data is limited.
Msdtron: a high-capability multi-speaker speech synthesis system for diverse data using characteristic information
TLDR
A high-capability speech synthesis system in which a representation of harmonic structure of speech, called excitation spectrogram, is designed to directly guide the learning of harmonics in mel-spectrogram, and conditional gated LSTM (CGLSTM) is proposed to control the flow of text-content information through network by re-weighting the L STM gates using speaker information.
Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis
TLDR
Two directions are explored: forcing the network to learn a better speaker identity representation by appending an additional loss term; and augmenting the input data pertaining to each speaker using waveform manipulation methods that improve the Intelligibility of the multispeaker TTS system.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 26 REFERENCES
Sample Efficient Adaptive Text-to-Speech
TLDR
Three strategies are introduced and benchmark three strategies at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers.
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
TLDR
It is shown that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.
Speaker Adaptation in DNN-Based Speech Synthesis Using d-Vectors
TLDR
The proposed method of unsupervised adaptation using the d-vector is compared with the commonly used i-vector based approach for speaker adaptation and listening tests show that: (1) for speech quality, the DNN based approach is significantly preferred over the i- vector based approach; and (2) for speaker similarity, both d- vector and i- vectors were found to perform similar.
Tacotron: Towards End-to-End Speech Synthesis
TLDR
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Neural Voice Cloning with a Few Samples
TLDR
While speaker adaptation can achieve better naturalness and similarity, the cloning time or required memory for the speaker encoding approach is significantly less, making it favorable for low-resource deployment.
Fitting New Speakers Based on a Short Untranscribed Sample
TLDR
This work presents a method that is designed to capture a new speaker from a short untranscribed audio sample by employing an additional network that given an audio sample, places the speaker in the embedding space.
Deep Voice 3: 2000-Speaker Neural Text-to-Speech
TLDR
Deep Voice 3 is presented, a fully-convolutional attention-based neural text-to-speech (TTS) system that matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster.
Deep neural networks for small footprint text-dependent speaker verification
TLDR
Experimental results show the DNN based speaker verification system achieves good performance compared to a popular i-vector system on a small footprint text-dependent speaker verification task and is more robust to additive noise and outperforms the i- vector system at low False Rejection operating points.
VoiceLoop: Voice Fitting and Synthesis via a Phonological Loop
TLDR
A new neural text tospeech method that is able to transform text to speech in voices that are sampled in the wild and without requiring aligned phonemes or linguistic features is presented, making TTS accessible to a wider range of applications.
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps
...
1
2
3
...