BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization

  title={BOFFIN TTS: Few-Shot Speaker Adaptation by Bayesian Optimization},
  author={Henry B. Moss and Vatsal Aggarwal and Nishant Prateek and Javier I. Gonz{\'a}lez and Roberto Barra-Chicote},
  journal={ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
We present BOFFIN TTS (Bayesian Optimization For FIne-tuning Neural Text To Speech), a novel approach for few-shot speaker adaptation. Here, the task is to fine-tune a pre-trained TTS model to mimic a new speaker using a small corpus of target utterances. We demonstrate that there does not exist a one-size-fits-all adaptation strategy, with convincing synthesis requiring a corpus-specific configuration of the hyper-parameters that control fine-tuning. By using Bayesian optimization to… 

Figures and Tables from this paper

GC-TTS: Few-shot Speaker Adaptation with Geometric Constraints
The experimental results demonstrate that GC-TTS generates high-quality speech from only a few minutes of training data, outperforming standard techniques in terms of speaker similarity to the target speaker.
nnSpeech: Speaker-Guided Conditional Variational Autoencoder for Zero-shot Multi-speaker Text-to-Speech
A zero-shot multi-speaker TTS that could synthesis a new speaker voice without fine-tuning and using only one adaption utterance is proposed, named nnSpeech, which can generate a variable Z, which contains both speaker characteristics and content information.
AdaDurIAN: Few-shot Adaptation for Neural Text-to-Speech with DurIAN
AdaDurIAN is introduced by training an improved DurIAN-based average model and leverage it to few-shot learning with the shared speaker-independent content encoder across different speakers and can outperform the baseline end-to-end system by a large margin.
Adaspeech 2: Adaptive Text to Speech with Untranscribed Data
  • Yuzi Yan, Xu Tan, Tie-Yan Liu
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
This paper develops AdaSpeech 2, an adaptive TTS system that only leverages untranscribed speech data for adaptation, and introduces a mel-spectrogram encoder to a well-trained TTS model to conduct speech reconstruction, and constrain the output sequence of the mel-Spectrogram encoding to be close to that of the original phoneme encoder.
Using IPA-Based Tacotron for Data Efficient Cross-Lingual Speaker Adaptation and Pronunciation Enhancement
It is shown that by applying minor changes to a Tacotron model, one can transfer an existing TTS model for a new speaker with the same or a different language using only 20 minutes of data.
AdaSpeech: Adaptive Text to Speech for Custom Voice
AdaSpeech is proposed, an adaptive TTS system for high-quality and efficient customization of new voices and achieves much better adaptation quality than baseline methods, with only about 5K specific parameters for each speaker, which demonstrates its effectiveness for custom voice.
Unet-TTS: Improving Unseen Speaker and Style Transfer in One-shot Voice Cloning
A novel one-shot voice cloning algorithm called Unet-TTS that has good generalization ability for unseen speakers and styles is presented that outperforms both speaker embedding and unsupervised style modeling (GST) approaches on an unseen emotional corpus.
AdaVocoder: Adaptive Vocoder for Custom Voice
The empirical results show that a high-quality custom voice system can be built by combining a adaptive acoustic model with a adaptive vocoder, mainly using a cross-domain consistency loss to solve the overfitting problem encountered by the GAN-based neural vocoder in the transfer learning of few-shot scenes.
Unsupervised Learning For Sequence-to-sequence Text-to-speech For Low-resource Languages
Both objective and subjective evaluations show that the proposed unsupervised pre-training mechanism can synthesize more intelligible and natural speech with the same amount of paired training data.
GANSpeech: Adversarial Training for High-Fidelity Multi-Speaker Speech Synthesis
This work proposes GANSpeech, which is a high-fidelity multi-speaker TTS model that adopts the adversarial training method to a non-autoregressive multi-Speaker T TS model, and proposes simple but efficient automatic scaling methods for feature matching loss used in adversarialTraining.


Sample Efficient Adaptive Text-to-Speech
Three strategies are introduced and benchmark three strategies at adapting the multi-speaker neural network to new speakers, obtaining state-of-the-art results in both sample naturalness and voice similarity with merely a few minutes of audio data from new speakers.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
Effect of Data Reduction on Sequence-to-sequence Neural TTS
This paper shows that the lack of data from one speaker can be compensated with data from other speakers, and the naturalness of Tacotron2-like models trained on a blend of 5k utterances from 7 speakers is better than or equivalent to that of speaker dependent modelstrained on 15k utterance.
Neural Voice Cloning with a Few Samples
While speaker adaptation can achieve better naturalness and similarity, the cloning time or required memory for the speaker encoding approach is significantly less, making it favorable for low-resource deployment.
Tacotron: Towards End-to-End Speech Synthesis
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Deep Speaker: an End-to-End Neural Speaker Embedding System
Results that suggest adapting from a model trained with Mandarin can improve accuracy for English speaker recognition are presented, and it is suggested that Deep Speaker outperforms a DNN-based i-vector baseline.
Deep Voice 2: Multi-Speaker Neural Text-to-Speech
It is shown that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.
WaveNet: A Generative Model for Raw Audio
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Fitting New Speakers Based on a Short Untranscribed Sample
This work presents a method that is designed to capture a new speaker from a short untranscribed audio sample by employing an additional network that given an audio sample, places the speaker in the embedding space.
Semi-supervised Training for Improving Data Efficiency in End-to-end Speech Synthesis
A semi-supervised training framework is proposed to improve the data efficiency of Tacotron and allow it to utilize textual and acoustic knowledge contained in large, publicly-available text and speech corpora.