SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

  title={SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model},
  author={Edresson Casanova and Christopher Dane Shulby and Eren G{\"o}lge and Nicolas Michael M{\"u}ller and Frederico Santos de Oliveira and Arnaldo C{\^a}ndido J{\'u}nior and Anderson da Silva Soares and Sandra Maria Alu{\'i}sio and Moacir Antonelli Ponti},
In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transformer-based encoder. Additionally, we have shown that adjusting a GAN-based vocoder for the spectrograms… 

Figures and Tables from this paper

A Survey on Neural Speech Synthesis
A comprehensive survey on neural TTS is conducted, aiming to provide a good understanding of current research and future trends, and focuses on the key components in neural T TS, including text analysis, acoustic models, and vocoders.
YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
The YourTTS model builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training, achieving state-of-the-art (SOTA) results in zero- shot multi- Speaker TTS and results comparable to SOTA inzero-shot voice conversion on the VCTK dataset.
Glow-WaveGAN 2: High-quality Zero-shot Text-to-speech Synthesis and Any-to-any Voice Conversion
The zero-shot scenario for speech generation aims at synthe-sizing a novel unseen voice with only one utterance of the target speaker. Although the challenges of adapting new voices in zero-shot
GlowVC: Mel-spectrogram space disentangling model for language-independent text-free voice conversion
In this paper, we propose GlowVC: a multilingual multi-speaker flow-based model for language-independent text-free voice conversion. We build on Glow-TTS, which provides an architecture that enables
Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data
We propose Guided-TTS 2, a diffusion-based generative model for high-quality adaptive TTS using untranscribed data. Guided-TTS 2 combines a speakerconditional diffusion model with a speaker-dependent
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis
GenerSpeech is proposed, a text-to-speech model towards high-fidelity zero-shot style transfer of OOD custom voice that surpasses the state-of-the-art models in terms of audio quality and style similarity.
MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks
This work describes the participation of the MLLP-VRAIN research group in the two shared tasks of the IWSLT 2022 conference: Simultaneous Speech Translation and Speech-to-Speech Translation. We
The USTC-NELSLIP Offline Speech Translation Systems for IWSLT 2022
This paper describes USTC-NELSLIP’s submissions to the IWSLT 2022 Offline Speech Translation task, including speech translation of talks from English to German, English to Chinese and English to
Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization
This work introduces a content driven audio-visual deepfake dataset, termed as Localized Audio Visual DeepFake (LAV-DF), explicitly designed for the task of learning temporal forgery localization and demonstrates the strong performance of the proposed method for both tasks of temporal forgeries localization and deepfake detection.
Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus
This paper proposes a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training, and extends the proposed method to zero-shot multi-speaker TTS (ZS-TTS).


Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings
Learnable dictionary encoding-based speaker embeddings with angular softmax loss can improve equal error rates over x-vectors in a speaker verification task and improve speaker similarity and naturalness for unseen speakers when used for zero-shot adaptation to new speakers in end-to-end speech synthesis.
Non-Autoregressive Neural Text-to-Speech
ParaNet, a non-autoregressive seq2seq model that converts text to spectrogram is proposed, which is fully convolutional and brings 46.7 times speed-up over the lightweight Deep Voice 3 at synthesis, while obtaining reasonably good speech quality.
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding
Attentron is proposed, a few-shot TTS model that clones voices of speakers unseen during training that significantly outperforms state-of-the-art models when generating speech for unseen speakers in terms of speaker similarity and quality.
Flowtron: an Autoregressive Flow-based Generative Network for Text-to-Speech Synthesis
The mean opinion scores (MOS) show that Flowtron matches state-of-the-art TTS models in terms of speech quality, and results on control of speech variation, interpolation between samples and style transfer between speakers seen and unseen during training are provided.
FastSpeech: Fast, Robust and Controllable Text to Speech
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
SpeedySpeech: Efficient Neural Speech Synthesis
It is shown that self-attention layers are not necessary for generation of high quality audio and a student-teacher network capable of high-quality faster-than-real-time spectrogram synthesis is proposed, with low requirements on computational resources and fast training time.
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis.
Deep Voice 3: 2000-Speaker Neural Text-to-Speech
Deep Voice 3 is presented, a fully-convolutional attention-based neural text-to-speech (TTS) system that matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster.
Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions
This work proposes a variant of WaveRNN, referred to as speaker conditional waveRNN (SC-WaveRNN), aimed towards the development of an efficient universal vocoder even for unseen speakers and recording conditions, and implements a multi-speaker text-to-speech (TTS) synthesis similar to zero-shot speaker adaptation.