• Corpus ID: 244908340

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

@article{Casanova2022YourTTSTZ,
  title={YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone},
  author={Edresson Casanova and Julian Weber and Christopher Dane Shulby and Arnaldo C{\^a}ndido J{\'u}nior and Eren G{\"o}lge and Moacir Antonelli Ponti},
  journal={ArXiv},
  year={2022},
  volume={abs/2112.02418}
}
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multispeaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multispeaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening… 

Figures and Tables from this paper

Low-Resource Multilingual and Zero-Shot Multispeaker TTS

Using the language agnostic meta learning (LAML) procedure and modifications to a TTS encoder, it is shown that it is possible for a system to learn speaking a new language using just 5 minutes of training data while retaining the ability to infer the voice of even unseen speakers in the newly learned language.

AdaSpeech 4: Adaptive Text to Speech in Zero-Shot Scenarios

AdaSpeech 4 is developed, a zero-shot adaptive TTS system for high-quality speech synthesis that achieves better voice quality and similarity than baselines in multiple datasets without any fine-tuning.

Zero-Shot Long-Form Voice Cloning with Dynamic Convolution Attention

This work proposes a variant of attention-based text-to-speech system that can reproduce a target voice from a few seconds of reference speech and generalize to very long utterances as well, while preserving a high extent of naturalness and similarity for short texts.

Adapting TTS models For New Speakers using Transfer Learning

It is found that fine-tuning a single-speaker TTS model on just 30 minutes of data, can yield comparable performance to a model trained from scratch on more than 27 hours of data for both male and female target speakers.

Residual Adapters for Few-Shot Text-to-Speech Speaker Adaptation

Experimental results show that the proposed approach can achieve competitive naturalness and speaker similarity compared to the full fine-tuning approaches, while requiring only ∼ 0.1% of the backbone model parameters for each speaker.

SANE-TTS: Stable And Natural End-to-End Multilingual Text-to-Speech

This paper introduces speaker regularization loss that improves speech naturalness during cross-lingual synthesis as well as domain adversarial training, which is applied in other multilingual TTS models.

Cross-lingual Text-To-Speech with Flow-based Voice Conversion for Improved Pronunciation

This paper presents a method for end-to-end cross-lingual text-to-speech (TTS) which aims to preserve the target language’s pronunciation regardless of the original speaker’s language. The model used

Towards Building Text-To-Speech Systems for the Next Billion Users

This paper evaluates the choice of acoustic models, vocoders, supplementary loss functions, training schedules, and speaker and language diversity for Dravidian and Indo-Aryan languages and identifies monolingual models with FastPitch and HiFi-GAN V1, trained jointly on male and female speakers to perform the best.

Self supervised learning for robust voice cloning

This work utilizes features learned in a self-supervised framework via the Bootstrap Your Own Latent (BYOL) method, which is shown to produce high quality speech representations when spe-cific audio augmentations are applied to the vanilla algorithm.

Guided-TTS 2: A Diffusion Model for High-quality Adaptive Text-to-Speech with Untranscribed Data

We propose Guided-TTS 2, a diffusion-based generative model for high-quality adaptive TTS using untranscribed data. Guided-TTS 2 combines a speakerconditional diffusion model with a speaker-dependent

References

SHOWING 1-10 OF 48 REFERENCES

SC-GlowTTS: an Efficient Zero-Shot Multi-Speaker Text-To-Speech Model

The proposed SC-GlowTTS model is ancient zero-shot multi-speaker text-to-speech model that improves similarity for speakers unseen during training, and it is shown that adjusting a GAN-based vocoder for the spectrograms predicted by the TTS model on the training dataset can improve the similarity and speech quality for new speakers.

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Learnable dictionary encoding-based speaker embeddings with angular softmax loss can improve equal error rates over x-vectors in a speaker verification task and improve speaker similarity and naturalness for unseen speakers when used for zero-shot adaptation to new speakers in end-to-end speech synthesis.

One Model, Many Languages: Meta-learning for Multilingual Text-to-Speech

We introduce an approach to multilingual speech synthesis which uses the meta-learning concept of contextual parameter generation and produces natural-sounding multilingual speech using more

Normalization Driven Zero-Shot Multi-Speaker Speech Synthesis

A novel zero-shot multi-speaker speech synthesis approach (ZSM-SS) that leverages the normalization architecture and speaker encoder with non-autoregressive multi-head attention driven encoder-decoder architecture is presented.

Attentron: Few-Shot Text-to-Speech Utilizing Attention-Based Variable-Length Embedding

Attentron is proposed, a few-shot TTS model that clones voices of speakers unseen during training that significantly outperforms state-of-the-art models when generating speech for unseen speakers in terms of speaker similarity and quality.

Light-TTS: Lightweight Multi-Speaker Multi-Lingual Text-to-Speech

This paper proposes a new lightweight multi-speaker multi-lingual speech synthesis system, named LightTTS, which can quickly synthesize the Chinese, English or code-switch speech of multiple speakers in a non-autoregressive generation manner using only one model.

NoiseVC: Towards High Quality Zero-Shot Voice Conversion

NoiseVC is proposed, an approach that can disentangle contents based on VQ and Contrastive Predictive Coding (CPC) and has a strong disentanglement ability with a small sacrifice of quality.

End-to-end Code-switched TTS with Mix of Monolingual Recordings

  • Yuewen CaoXixin Wu H. Meng
  • Computer Science
    ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2019
The proposed E2E TTS systems can generate controllable foreign-accented speech at character-level using only mixture of monolingual training data and are confirmed to be effective in terms of quality and speaker similarity of the generated speech.

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

Zero-Shot Voice Style Transfer with Only Autoencoder Loss

A new style transfer scheme that involves only an autoencoder with a carefully designed bottleneck is proposed, which achieves state-of-the-art results in many-to-many voice conversion with non-parallel data and is the first to perform zero-shot voice conversion.