FedSpeech: Federated Text-to-Speech with Continual Learning

  title={FedSpeech: Federated Text-to-Speech with Continual Learning},
  author={Ziyue Jiang and Yi Ren and Ming Lei and Zhou Zhao},
Federated learning enables collaborative training of machine learning models under strict privacy restrictions and federated text-to-speech aims to synthesize natural speech of multiple users with a few audio training samples stored in their devices locally. However, federated text-to-speech faces several challenges: very few training samples from each speaker are available, training samples are all stored in local device of each user, and global model is vulnerable to various attacks. In this… 

Figures and Tables from this paper


Federated Learning for Keyword Spotting
An extensive empirical study of the federated averaging algorithm for the "Hey Snips" wake word based on a crowdsourced dataset that mimics a federation of wake word users shows that using an adaptive averaging strategy inspired from Adam highly reduces the number of communication rounds required to reach the target performance.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
FastSpeech: Fast, Robust and Controllable Text to Speech
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
Communication-Efficient Learning of Deep Networks from Decentralized Data
This work presents a practical method for the federated learning of deep networks based on iterative model averaging, and conducts an extensive empirical evaluation, considering five different model architectures and four datasets.
Federated Learning for Mobile Keyboard Prediction
The federation algorithm, which enables training on a higher-quality dataset for this use case, is shown to achieve better prediction recall and the feasibility and benefit of training language models on client devices without exporting sensitive user data to servers are demonstrated.
Generalized End-to-End Loss for Speaker Verification
A new loss function called generalized end-to-end (GE2E) loss is proposed, which makes the training of speaker verification models more efficient than the previous tuple-based end- to- end (TE2e) loss function.
Parallel Wavegan: A Fast Waveform Generation Model Based on Generative Adversarial Networks with Multi-Resolution Spectrogram
The proposed Parallel WaveGAN has only 1.44 M parameters and can generate 24 kHz speech waveform 28.68 times faster than real-time on a single GPU environment, which is comparative to the best distillation-based Parallel WaveNet system.
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps
Piggyback: Adapting a Single Network to Multiple Tasks by Learning to Mask Weights
This work learns binary masks that “piggyback” on an existing network, or are applied to unmodified weights of that network to provide good performance on a new task, and shows performance comparable to dedicated fine-tuned networks for a variety of classification tasks.
Compacting, Picking and Growing for Unforgetting Continual Learning
This paper introduces an incremental learning method that is scalable to the number of sequential tasks in a continual learning process and shows that the knowledge accumulated through learning previous tasks is helpful to build a better model for the new tasks compared to training the models independently with tasks.