Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions

@article{Shen2018NaturalTS,
  title={Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions},
  author={Jonathan Shen and Ruoming Pang and Ron J. Weiss and Mike Schuster and Navdeep Jaitly and Zongheng Yang and Z. Chen and Yu Zhang and Yuxuan Wang and R. J. Skerry-Ryan and Rif A. Saurous and Yannis Agiomyrgiannakis and Yonghui Wu},
  journal={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2018},
  pages={4779-4783}
}
  • Jonathan Shen, Ruoming Pang, +10 authors Yonghui Wu
  • Published 16 December 2017
  • Computer Science
  • 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. [...] Key Result We further show that using this compact acoustic intermediate representation allows for a significant reduction in the size of the WaveNet architecture.Expand
Parallel WaveNet conditioned on VAE latent vectors
TLDR
The use of a sentence-level conditioning vector to improve the signal quality of a Parallel WaveNet neural vocoder with the latent vector from a pre-trained VAE component of a Tacotron 2-style sequence-to-sequence model is investigated. Expand
GELP: GAN-Excited Linear Prediction for Speech Synthesis from Mel-spectrogram
TLDR
This paper proposes an alternative training strategy for a parallel neural vocoder utilizing generative adversarial networks, and integrates a linear predictive synthesis filter into the model, and shows that the proposed model achieves significant improvement in inference speed, while outperforming a WaveNet in copy-synthesis quality. Expand
Prosody generation for text-to-speech synthesis
TLDR
This work proposes a template-based approach for automatic F0 generation, where per-syllable pitch-contour templates are predicted by a recurrent neural network (RNN), which mitigates the over-smoothing problem and is able to reproduce pitch patterns observed in the data. Expand
Modelling Intonation in Spectrograms for Neural Vocoder based Text-to-Speech
TLDR
Compared to the original model, the spectrogram extension gives better mean opinion scores in subjective listening tests, and it is shown that the intonation in the generated spectrograms match theintonation represented by the generated pitch curves. Expand
Multi-speaker TTS with Deep Learning
Recent advancements in technology have allowed for great development in the field of Speech Synthesis. As such, present-day speech synthesis applications are expected to function for multiple voices,Expand
The Tencent speech synthesis system for Blizzard Challenge 2020
This paper presents the Tencent speech synthesis system for Blizzard Challenge 2019. The corpus released to the participants this year is a about 8 hours of speech data from an internet talk show byExpand
Lombard Speech Synthesis Using Transfer Learning in a Tacotron Text-to-Speech System
TLDR
The subjective and objective evaluation results indicated that the proposed adaptation system coupled with the WaveNet vocoder clearly outperformed the conventional deep neural network based TTS system in the synthesis of Lombard speech. Expand
The Duke Entry for 2020 Blizzard Challenge
This paper presents the speech synthesis system built for the 2020 Blizzard Challenge by team ‘H’. The goal of the challenge is to build a synthesizer that is able to generate high-fidelity speechExpand
FastSpeech: Fast, Robust and Controllable Text to Speech
TLDR
A novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS is proposed, which speeds up mel-Spectrogram generation by 270x and the end-to-end speech synthesis by 38x and is called FastSpeech. Expand
FastSpeech : Fast , Robust and Controllable Text to Speech
Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram fromExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 31 REFERENCES
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition. Expand
Tacotron: Towards End-to-End Speech Synthesis
TLDR
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness. Expand
Parallel WaveNet: Fast High-Fidelity Speech Synthesis
The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previousExpand
Deep Voice: Real-time Neural Text-to-Speech
TLDR
Deep Voice lays the groundwork for truly end-to-end neural speech synthesis and shows that inference with the system can be performed faster than real time and describes optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations. Expand
Unit selection in a concatenative speech synthesis system using a large speech database
  • Andrew J. Hunt, A. Black
  • Computer Science
  • 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings
  • 1996
TLDR
It is proposed that the units in a synthesis database can be considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units. Expand
Statistical parametric speech synthesis using deep neural networks
TLDR
This paper examines an alternative scheme that is based on a deep neural network (DNN), the relationship between input texts and their acoustic realizations is modeled by a DNN, and experimental results show that the DNN- based systems outperformed the HMM-based systems with similar numbers of parameters. Expand
Speaker-Dependent WaveNet Vocoder
TLDR
A speaker-dependent WaveNet vocoder is proposed, a method of synthesizing speech waveforms with WaveNet, by utilizing acoustic features from existing vocoder as auxiliary features of WaveNet. Expand
Attention-Based Models for Speech Recognition
TLDR
The attention-mechanism is extended with features needed for speech recognition and a novel and generic method of adding location-awareness to the attention mechanism is proposed to alleviate the issue of high phoneme error rate. Expand
Char2Wav: End-to-End Speech Synthesis
TLDR
Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features. Expand
Deep Voice 3: 2000-Speaker Neural Text-to-Speech
TLDR
Deep Voice 3 is presented, a fully-convolutional attention-based neural text-to-speech (TTS) system that matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. Expand
...
1
2
3
4
...