SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

@inproceedings{Casanova2021SCGlowTTSAE,
  title={SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model},
  author={Edresson Casanova and Christopher Dane Shulby and Eren G{\"o}lge and Nicolas Michael M{\"u}ller and Frederico Santos de Oliveira and Arnaldo C{\^a}ndido J{\'u}nior and Anderson da Silva Soares and Sandra Maria Alu{\'i}sio and Moacir Antonelli Ponti},
  booktitle={Interspeech},
  year={2021}
}
In this paper, we propose SC-GlowTTS: an efficient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training. We propose a speaker-conditional architecture that explores a flow-based decoder that works in a zero-shot scenario. As text encoders, we explore a dilated residual convolutional-based encoder, gated convolutional-based encoder, and transformer-based encoder. Additionally, we have shown that adjusting a GAN-based vocoder for the spectrograms… 

Figures and Tables from this paper

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
TLDR
The YourTTS model builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training, achieving state-of-the-art (SOTA) results in zero- shot multi- Speaker TTS and results comparable to SOTA inzero-shot voice conversion on the VCTK dataset.
Do You Really Mean That? Content Driven Audio-Visual Deepfake Dataset and Multimodal Method for Temporal Forgery Localization
TLDR
This work introduces a content driven audio-visual deepfake dataset, termed as Localized Audio Visual DeepFake (LAV-DF), explicitly designed for the task of learning temporal forgery localization and demonstrates the strong performance of the proposed method for both tasks of temporal forgeries localization and deepfake detection.
GenerSpeech: Towards Style Transfer for Generalizable Out-Of-Domain Text-to-Speech Synthesis
Style transfer for out-of-domain (OOD) speech synthesis aims to generate speech samples with unseen style (e.g., speaker identity, emotion, and prosody) derived from an acoustic reference, while
MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks
This work describes the participation of the MLLP-VRAIN research group in the two shared tasks of the IWSLT 2022 conference: Simultaneous Speech Translation and Speech-to-Speech Translation. We
Text-free non-parallel many-to-many voice conversion using normalising flows
TLDR
Flow-based VC evaluations show no degradation between text-free and text-conditioned VC, resulting in improvements over the state-of-the-art, and joint-training of the prior is found to negatively impacttext-free VC quality.
The USTC-NELSLIP Offline Speech Translation Systems for IWSLT 2022
This paper describes USTC-NELSLIP’s submissions to the IWSLT 2022 Offline Speech Translation task, including speech translation of talks from English to German, English to Chinese and English to
Transfer Learning Framework for Low-Resource Text-to-Speech using a Large-Scale Unlabeled Speech Corpus
TLDR
This paper proposes a transfer learning framework for TTS that utilizes a large amount of unlabeled speech dataset for pre-training, and extends the proposed method to zero-shot multi-speaker TTS (ZS-TTS).
Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention
TLDR
This work proposes a variant of attention-based text-to-speech system that can reproduce a target voice from a few seconds of reference speech and generalize to very long utterances as well, while preserving a high extent of naturalness and similarity for short texts.
A Survey on Neural Speech Synthesis
TLDR
A comprehensive survey on neural TTS is conducted, aiming to provide a good understanding of current research and future trends, and focuses on the key components in neural T TS, including text analysis, acoustic models, and vocoders.
AnyoneNet: Synchronized Speech and Talking Head Generation for Arbitrary Person
TLDR
An automatic method to generate synchronized speech and talking-head videos on the basis of text and a single face image of an arbitrary person as input that outperforms the state-of-the-art landmark-based method on generating natural talking head videos.
...
1
2
...

References

SHOWING 1-10 OF 37 REFERENCES
Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
TLDR
It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.
FastSpeech: Fast, Robust and Controllable Text to Speech
TLDR
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.
SpeedySpeech: Efficient Neural Speech Synthesis
TLDR
It is shown that self-attention layers are not necessary for generation of high quality audio and a student-teacher network capable of high-quality faster-than-real-time spectrogram synthesis is proposed, with low requirements on computational resources and fast training time.
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
TLDR
It is demonstrated that modeling periodic patterns of an audio is crucial for enhancing sample quality and the generality of HiFi-GAN is shown to the mel-spectrogram inversion of unseen speakers and end-to-end speech synthesis.
Deep Voice 3: 2000-Speaker Neural Text-to-Speech
TLDR
Deep Voice 3 is presented, a fully-convolutional attention-based neural text-to-speech (TTS) system that matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster.
Speaker Conditional WaveRNN: Towards Universal Neural Vocoder for Unseen Speaker and Recording Conditions
TLDR
This work proposes a variant of WaveRNN, referred to as speaker conditional waveRNN (SC-WaveRNN), aimed towards the development of an efficient universal vocoder even for unseen speakers and recording conditions, and implements a multi-speaker text-to-speech (TTS) synthesis similar to zero-shot speaker adaptation.
VoxCeleb2: Deep Speaker Recognition
TLDR
A very large-scale audio-visual speaker recognition dataset collected from open-source media is introduced and Convolutional Neural Network models and training strategies that can effectively recognise identities from voice under various conditions are developed and compared.
Efficiently Trainable Text-to-Speech System Based on Deep Convolutional Networks with Guided Attention
TLDR
This paper describes a novel text-to-speech (TTS) technique based on deep convolutional neural networks (CNN), without use of any recurrent units, to alleviate the economic costs of training.
X-Vectors: Robust DNN Embeddings for Speaker Recognition
TLDR
This paper uses data augmentation, consisting of added noise and reverberation, as an inexpensive method to multiply the amount of training data and improve robustness of deep neural network embeddings for speaker recognition.
Exploring the Encoding Layer and Loss Function in End-to-End Speaker and Language Recognition System
TLDR
Experimental results on Voxceleb and NIST LRE 07 datasets show that the performance of end-to-end learning system could be significantly improved by the proposed encoding layer and loss function.
...
1
2
3
4
...