• Corpus ID: 240231556

VRAIN-UPV MLLP's system for the Blizzard Challenge 2021

@article{Martos2021VRAINUPVMS,
  title={VRAIN-UPV MLLP's system for the Blizzard Challenge 2021},
  author={Alejandro P{\'e}rez Gonz{\'a}lez de Martos and Alberto Sanch{\'i}s and Alfons Juan-C{\'i}scar},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.15792}
}
This paper presents the VRAIN-UPV MLLP’s speech synthesis system for the SH1 task of the Blizzard Challenge 2021. The SH1 task consisted in building a Spanish text-to-speech system trained on (but not limited to) the corpus released by the Blizzard Challenge 2021 organization. It included 5 hours of studioquality recordings from a native Spanish female speaker. In our case, this dataset was solely used to build a two-stage neural text-to-speech pipeline composed of a non-autoregressive acoustic… 

Figures and Tables from this paper

MLLP-VRAIN UPV systems for the IWSLT 2022 Simultaneous Speech Translation and Speech-to-Speech Translation tasks

This work describes the participation of the MLLP-VRAIN research group in the two shared tasks of the IWSLT 2022 conference: Simultaneous Speech Translation and Speech-to-Speech Translation. We

Low-Resource Multilingual and Zero-Shot Multispeaker TTS

Using the language agnostic meta learning (LAML) procedure and modifications to a TTS encoder, it is shown that it is possible for a system to learn speaking a new language using just 5 minutes of training data while retaining the ability to infer the voice of even unseen speakers in the newly learned language.

Requirements and Motivations of Low-Resource Speech Synthesis for Language Revitalization

By building speech synthesis systems for three Indigenous languages spoken in Canada, Kanien’kéha, Gitksan & SENĆOŦEN, this paper re-evaluate the question of how much data is required to build low-resource speech synthesis system featuring state-of-the-art neural models.

Edinburgh Research Explorer Requirements and Motivations of Low-Resource Speech Synthesis for Language Revitalization

By building speech synthesis systems for three Indigenous languages spoken in Canada, Kanien’kéha, Gitksan & SENĆOŦEN, this paper re-evaluate the question of how much data is required to build low-resource speech synthesis system featur-ing state-of-the-art neural models.

Prosody Cloning in Zero-Shot Multispeaker Text-to-Speech

It is shown that it is possible to clone the voice of a speaker as well as the prosody of a spoken reference independently without any degradation in quality and high similarity to both original voice and prosody, as the objective evaluation and human study show.

References

SHOWING 1-10 OF 22 REFERENCES

Non-Attentive Tacotron: Robust and Controllable Neural TTS Synthesis Including Unsupervised Duration Modeling

Non-Attentive Tacotron is presented, replacing the attention mechanism with an explicit duration predictor, which improves robustness significantly as measured by unaligned duration ratio and word deletion rate, two metrics introduced in this paper for large-scale robustness evaluation using a pre-trained speech recognition model.

Tacotron: Towards End-to-End Speech Synthesis

Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.

Close to Human Quality TTS with Transformer

This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality.

Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps

DurIAN: Duration Informed Attention Network for Speech Synthesis

It is shown that proposed DurIAN system could generate highly natural speech that is on par with current state of the art end-to-end systems, while being robust and stable at the same time.

SpeedySpeech: Efficient Neural Speech Synthesis

It is shown that self-attention layers are not necessary for generation of high quality audio and a student-teacher network capable of high-quality faster-than-real-time spectrogram synthesis is proposed, with low requirements on computational resources and fast training time.

Neural Speech Synthesis with Transformer Network

This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality.

Robust Sequence-to-Sequence Acoustic Modeling with Stepwise Monotonic Attention for Neural TTS

The experimental results show that the proposed stepwise monotonic attention method could achieve significant improvements in robustness on out-of-domain scenarios for phoneme-based models, without any regression on the in-domain naturalness test.

Parallel Tacotron: Non-Autoregressive and Controllable TTS

  • Isaac EliasH. Zen Yonghui Wu
  • Computer Science
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
A non-autoregressive neural text-to-speech model augmented with a variational autoencoder-based residual encoder, called Parallel Tacotron, which is highly parallelizable during both training and inference, allowing efficient synthesis on modern parallel hardware.

FastSpeech: Fast, Robust and Controllable Text to Speech

A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech.