Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis

  title={Speaker verification-derived loss and data augmentation for DNN-based multispeaker speech synthesis},
  author={Be{\'a}ta Lőrincz and Adriana Stan and Mircea Giurgiu},
  journal={2021 29th European Signal Processing Conference (EUSIPCO)},
Building multispeaker neural network-based text-to-speech synthesis systems commonly relies on the availability of large amounts of high quality recordings from each speaker and conditioning the training process on the speaker's identity or on a learned representation of it. However, when little data is available from each speaker, or the number of speakers is limited, the multispeaker TTS can be hard to train and will result in poor speaker similarity and naturalness. In order to address this… 

Figures and Tables from this paper

Contributions to neural speech synthesis using limited data enhanced with lexical features

Building single or multi-speaker neural network-based text-to-speech synthesis systems commonly relies on the availability of large amounts of high quality recordings from each speaker and

Voice Filter: Few-Shot Text-to-Speech Speaker Adaptation Using Voice Conversion as a Post-Processing Module

Results show that the Voice Filter outperforms state-of-the-art few-shot speech synthesis techniques in terms of objective and subjective metrics on one minute of speech on a diverse set of voices, while being competitive against a TTS model built on 30 times more data.

A review on state-of-the-art Automatic Speaker verification system from spoofing and anti-spoofing perspective

Background/Objectives : The anti-spoofing measures are blooming with an aim to protect the Automatic Speaker Verification systems from susceptible spoofing attacks. This review is an amalgam of the



From Speaker Verification to Multispeaker Speech Synthesis, Deep Transfer with Feedback Constraint

A system involving feedback constraint for multispeaker speech synthesis is presented, which manages to enhance the knowledge transfer from the speaker verification to the speech synthesis by engaging the speaker verify network.

Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

It is shown that randomly sampled speaker embeddings can be used to synthesize speech in the voice of novel speakers dissimilar from those used in training, indicating that the model has learned a high quality speaker representation.

Can Speaker Augmentation Improve Multi-Speaker End-to-End TTS?

This work investigates an orthogonal approach to the current speaker adaptation paradigms, speaker augmentation, by creating artificial speakers and by taking advantage of low-quality data, to account for the channel and dialect factors inherent in corpora.

Low-Resource Expressive Text-To-Speech Using Data Augmentation

This work presents a novel 3-step methodology to circumvent the costly operation of recording large amounts of target data in order to build expressive style voices with as little as 15 minutes of such recordings.

Deep Voice 2: Multi-Speaker Neural Text-to-Speech

It is shown that a single neural TTS system can learn hundreds of unique voices from less than half an hour of data per speaker, while achieving high audio quality synthesis and preserving the speaker identities almost perfectly.

Zero-Shot Multi-Speaker Text-To-Speech with State-Of-The-Art Neural Speaker Embeddings

Learnable dictionary encoding-based speaker embeddings with angular softmax loss can improve equal error rates over x-vectors in a speaker verification task and improve speaker similarity and naturalness for unseen speakers when used for zero-shot adaptation to new speakers in end-to-end speech synthesis.

Cross-lingual Multi-speaker Text-to-speech Synthesis for Voice Cloning without Using Parallel Corpus for Unseen Speakers

It is found that normalizing speaker embedding x-vectors by L2-norm normalization or whitening improves output quality a lot in many cases, and the WaveNet performance seems to be language-independent: the authors' WaveNet is trained with Cantonese speech and can be used to generate Mandarin and English speech very well.

Learning Speaker Embedding from Text-to-Speech

This work jointly trained end-to-end Tacotron 2 TTS and speaker embedding networks in a self-supervised fashion and hypothesize that the embeddings will contain minimal phonetic information since the TTS decoder will obtain that information from the textual input.

Multi-Speaker End-to-End Speech Synthesis

It is demonstrated that the multi-speaker ClariNet outperforms state-of-the-art systems in terms of naturalness, because the whole model is jointly optimized in an end-to-end manner.

Semi-Supervised Speaker Adaptation for End-to-End Speech Synthesis with Pretrained Models

The proposed method can greatly simplify a speaker adaptation pipeline by consistently employing end-to-end ASR/TTS ecosystems and achieved comparable performance to a paired data adaptation method in terms of subjective speaker similarity and objective cepstral distance measures.