Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?
@inproceedings{Cooper2020CanSA, title={Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?}, author={Erica Cooper and Cheng-I Lai and Yusuke Yasuda and Junichi Yamagishi}, booktitle={INTERSPEECH}, year={2020} }
Previous work on speaker adaptation for end-to-end speech synthesis still falls short in speaker similarity. We investigate an orthogonal approach to the current speaker adaptation paradigms, speaker augmentation, by creating artificial speakers and by taking advantage of low-quality data. The base Tacotron2 model is modified to account for the channel and dialect factors inherent in these corpora. In addition, we describe a warm-start training strategy that we adopted for Tacotron2 training. A…
7 Citations
Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis
- Computer Science2021 29th European Signal Processing Conference (EUSIPCO)
- 2021
Two directions are explored: forcing the network to learn a better speaker identity representation by appending an additional loss term; and augmenting the input data pertaining to each speaker using waveform manipulation methods that improve the Intelligibility of the multispeaker TTS system.
A Text-to-Speech Pipeline, Evaluation Methodology, and Initial Fine-Tuning Results for Child Speech Synthesis
- Computer ScienceIEEE Access
- 2022
This study developed and validated a training pipeline for fine-tuning state-of-the-art (SOTA) neural TTS models using child speech datasets and showed a strong correlation between real and synthetic child voices.
Data-augmented cross-lingual synthesis in a teacher-student framework
- Computer ScienceArXiv
- 2022
Results show that the proposed approach improves the retention of speaker characteristics in the speech, while manag-ing to retain high levels of naturalness and prosodic variation.
SpeechSplit 2.0: Unsupervised speech disentanglement for voice conversion Without tuning autoencoder Bottlenecks
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
This paper proposes S PEECH S PLIT 2.0, which constrains the information of the speech component to be disentangled on the autoencoder input using efficient signal processing methods instead of bottleneck tuning.
A Survey on Neural Speech Synthesis
- Computer ScienceArXiv
- 2021
A comprehensive survey on neural TTS is conducted, aiming to provide a good understanding of current research and future trends, and focuses on the key components in neural T TS, including text analysis, acoustic models, and vocoders.
Combining speakers of multiple languages to improve quality of neural voices
- Computer Science11th ISCA Speech Synthesis Workshop (SSW 11)
- 2021
In this work, we explore multiple architectures and training procedures for developing a multi-speaker and multi-lingual neural TTS system with the goals of a) improving the quality when the…
Improved Prosodic Clustering for Multispeaker and Speaker-Independent Phoneme-Level Prosody Control
- LinguisticsSPECOM
- 2021
This paper presents a method for phoneme-level prosody control of F0 and duration on a multispeaker text-to-speech setup, which is based on prosodic clustering, using an autoregressive attention-based model and employing data augmentation, F0 normalization, balanced clustering for duration, and speaker-independent prosodic clusters.
References
SHOWING 1-10 OF 41 REFERENCES
Multi-Speaker End-to-End Speech Synthesis
- Computer ScienceArXiv
- 2019
It is demonstrated that the multi-speaker ClariNet outperforms state-of-the-art systems in terms of naturalness, because the whole model is jointly optimized in an end-to-end manner.
Neural Text-to-Speech Adaptation from Low Quality Public Recordings
- Computer Science10th ISCA Workshop on Speech Synthesis (SSW 10)
- 2019
This work introduces meta-learning to adapt the neural TTS front-end and shows that for low quality public recordings, the adaptation based on the multi-speaker corpus can generate a cleaner target voice in comparison with the speaker-dependent model.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
- Computer ScienceNeurIPS
- 2018
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
Neural Voice Cloning with a Few Samples
- Computer ScienceNeurIPS
- 2018
While speaker adaptation can achieve better naturalness and similarity, the cloning time or required memory for the speaker encoding approach is significantly less, making it favorable for low-resource deployment.
Sample Efficient Adaptive Text-to-Speech
- Computer ScienceICLR
- 2019
Three strategies are introduced and benchmark three strategies at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers.
Tacotron: Towards End-to-End Speech Synthesis
- Computer ScienceINTERSPEECH
- 2017
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System
- Computer ScienceOdyssey
- 2018
Experimental results on Voxceleb and NIST LRE 07 datasets show that the performance of end-to-end learning system could be significantly improved by the proposed encoding layer and loss function.
Mel-spectrogram augmentation for sequence to sequence voice conversion
- Computer ScienceArXiv
- 2020
This study experimentally investigated the effects of Mel-spectrogram augmentation on training the sequence-to-sequence voice conversion (VC) model from scratch and suggested new policies (i.e., frequency warping, loudness and time length control) for more data variations.
Towards Robust Neural Vocoding for Speech Generation: A Survey
- Computer ScienceArXiv
- 2019
It is found that the speaker variety is much more important for achieving a universal vocoder than the language, and WaveNet and WaveRNN are more suitable for text-to-speech models, while Parallel WaveGAN is more suited for voice conversion applications.
Vocal Tract Length Perturbation (VTLP) improves speech recognition
- Computer Science
- 2013
Improvements in speech recognition are suggested without increasing the number of training epochs, and it is suggested that data transformations should be an important component of training neural networks for speech, especially for data limited projects.