Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?

@inproceedings{Cooper2020CanSA,
  title={Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?},
  author={Erica Cooper and Cheng-I Lai and Yusuke Yasuda and Junichi Yamagishi},
  booktitle={INTERSPEECH},
  year={2020}
}
Previous work on speaker adaptation for end-to-end speech synthesis still falls short in speaker similarity. We investigate an orthogonal approach to the current speaker adaptation paradigms, speaker augmentation, by creating artificial speakers and by taking advantage of low-quality data. The base Tacotron2 model is modified to account for the channel and dialect factors inherent in these corpora. In addition, we describe a warm-start training strategy that we adopted for Tacotron2 training. A… 

Figures and Tables from this paper

Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis
TLDR
Two directions are explored: forcing the network to learn a better speaker identity representation by appending an additional loss term; and augmenting the input data pertaining to each speaker using waveform manipulation methods that improve the Intelligibility of the multispeaker TTS system.
A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis
TLDR
This study developed and validated a training pipeline for fine-tuning state-of-the-art (SOTA) neural TTS models using child speech datasets and showed a strong correlation between real and synthetic child voices.
Data-augmented cross-lingual synthesis in a teacher-student framework
TLDR
Results show that the proposed approach improves the retention of speaker characteristics in the speech, while manag-ing to retain high levels of naturalness and prosodic variation.
SpeechSplit 2.0: Unsupervised speech disentanglement for voice conversion Without tuning autoencoder Bottlenecks
TLDR
This paper proposes S PEECH S PLIT 2.0, which constrains the information of the speech component to be disentangled on the autoencoder input using efficient signal processing methods instead of bottleneck tuning.
A Survey on Neural Speech Synthesis
TLDR
A comprehensive survey on neural TTS is conducted, aiming to provide a good understanding of current research and future trends, and focuses on the key components in neural T TS, including text analysis, acoustic models, and vocoders.
Combining speakers of multiple languages to improve quality of neural voices
In this work, we explore multiple architectures and training procedures for developing a multi-speaker and multi-lingual neural TTS system with the goals of a) improving the quality when the
Improved Prosodic Clustering for Multispeaker and Speaker-Independent Phoneme-Level Prosody Control
TLDR
This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering, using an autoregressive attention-based model and employing data augmentation, F0 normalization, balanced clustering for duration, and speaker-independent prosodic clusters.

References

SHOWING 1-10 OF 41 REFERENCES
Multi-Speaker End-to-End Speech Synthesis
TLDR
It is demonstrated that the multi-speaker ClariNet outperforms state-of-the-art systems in terms of naturalness, because the whole model is jointly optimized in an end-to-end manner.
Neural Text-to-Speech Adaptation from Low Quality Public Recordings
TLDR
This work introduces meta-learning to adapt the neural TTS front-end and shows that for low quality public recordings, the adaptation based on the multi-speaker corpus can generate a cleaner target voice in comparison with the speaker-dependent model.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
TLDR
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
Neural Voice Cloning with a Few Samples
TLDR
While speaker adaptation can achieve better naturalness and similarity, the cloning time or required memory for the speaker encoding approach is significantly less, making it favorable for low-resource deployment.
Sample Efficient Adaptive Text-to-Speech
TLDR
Three strategies are introduced and benchmark three strategies at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers.
Tacotron: Towards End-to-End Speech Synthesis
TLDR
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System
TLDR
Experimental results on Voxceleb and NIST LRE 07 datasets show that the performance of end-to-end learning system could be significantly improved by the proposed encoding layer and loss function.
Mel-spectrogram augmentation for sequence to sequence voice conversion
TLDR
This study experimentally investigated the effects of Mel-spectrogram augmentation on training the sequence-to-sequence voice conversion (VC) model from scratch and suggested new policies (i.e., frequency warping, loudness and time length control) for more data variations.
Towards Robust Neural Vocoding for Speech Generation: A Survey
TLDR
It is found that the speaker variety is much more important for achieving a universal vocoder than the language, and WaveNet and WaveRNN are more suitable for text-to-speech models, while Parallel WaveGAN is more suited for voice conversion applications.
Vocal Tract Length Perturbation (VTLP) improves speech recognition
TLDR
Improvements in speech recognition are suggested without increasing the number of training epochs, and it is suggested that data transformations should be an important component of training neural networks for speech, especially for data limited projects.
...
1
2
3
4
5
...