• Corpus ID: 30919574

Char2Wav: End-to-End Speech Synthesis

@inproceedings{Sotelo2017Char2WavES,
  title={Char2Wav: End-to-End Speech Synthesis},
  author={Jose M. R. Sotelo and Soroush Mehri and Kundan Kumar and Jo{\~a}o Felipe Santos and Kyle Kastner and Aaron C. Courville and Yoshua Bengio},
  booktitle={ICLR},
  year={2017}
}
We present Char2Wav, an end-to-end model for speech synthesis. [...] Key Method Neural vocoder refers to a conditional extension of SampleRNN which generates raw waveform samples from intermediate representations. Unlike traditional models for speech synthesis, Char2Wav learns to produce audio directly from text.Expand
Conditional End-to-End Audio Transforms
TLDR
An end-to-end method for transforming audio from one style to another based on convolutional and hierarchical recurrent neural networks, designed to capture long-term acoustic dependencies, requires minimal post-processing, and produces realistic audio transforms.
Myanmar Text-to-Speech Synthesis Using End-to-End Model
TLDR
This paper proposes a Myanmar speech synthesis system based on an End-to-End neural network model, which integrates the Myanmar phone model into the Tacotron2 End- to-End model, and introduces the BERT pre-training decoder module to assist the phone feature extraction.
Tacotron: Towards End-to-End Speech Synthesis
TLDR
Tacotron is presented, an end-to-end generative text- to-speech model that synthesizes speech directly from characters that achieves a 3.82 subjective 5-scale mean opinion score on US English, outperforming a production parametric system in terms of naturalness.
Multi-Speaker End-to-End Speech Synthesis
TLDR
It is demonstrated that the multi-speaker ClariNet outperforms state-of-the-art systems in terms of naturalness, because the whole model is jointly optimized in an end-to-end manner.
Parallel Neural Text-to-Speech
TLDR
This work proposes a non-autoregressive seq2seq model that converts text to spectrogram and builds the first fully parallel neural text-to-speech system by applying the inverse autoregressive flow~(IAF) as the parallel neural vocoder.
End-to-End Neural Speech Synthesis
In recent years, end-to-end neural networks have become the state of the art for speech recognition tasks and they are now widely deployed in industry (Amodei et al., 2016). Naturally, this has led
Tacotron: A Fully End-to-End Text-To-Speech Synthesis Model
TLDR
This paper presents Tacotron, an end- to-end generative text-to-speech model that synthesizes speech directly from characters, and presents several key techniques to make the sequence-tosequence framework perform well for this challenging task.
Lombard Speech Synthesis Using Transfer Learning in a Tacotron Text-to-Speech System
TLDR
The subjective and objective evaluation results indicated that the proposed adaptation system coupled with the WaveNet vocoder clearly outperformed the conventional deep neural network based TTS system in the synthesis of Lombard speech.
Neural Speech Synthesis with Transformer Network
TLDR
This paper introduces and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2, and achieves state-of-the-art performance and close to human quality.
CONVOLUTIONAL SEQUENCE LEARNING
We present Deep Voice 3, a fully-convolutional attention-based neural textto-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 44 REFERENCES
Merlin: An Open Source Neural Network Speech Synthesis System
TLDR
The Merlin speech synthesis toolkit for neural network-based speech synthesis takes linguistic features as input, and employs neural networks to predict acoustic features, which are then passed to a vocoder to produce the speech waveform.
SampleRNN: An Unconditional End-to-End Neural Audio Generation Model
TLDR
It is shown that the model, which profits from combining memory-less modules, namely autoregressive multilayer perceptrons, and stateful recurrent neural networks in a hierarchical structure is able to capture underlying sources of variations in the temporal sequences over very long time spans, on three datasets of different nature.
Listen, attend and spell: A neural network for large vocabulary conversational speech recognition
We present Listen, Attend and Spell (LAS), a neural speech recognizer that transcribes speech utterances directly to characters without pronunciation models, HMMs or other components of traditional
Towards End-To-End Speech Recognition with Recurrent Neural Networks
This paper presents a speech recognition system that directly transcribes audio data with text, without requiring an intermediate phonetic representation. The system is based on a combination of the
Statistical parametric speech synthesis: from HMM to LSTM-RNN
TLDR
The progress of acoustic modeling in SPSS from the HMM to the LSTM-RNN is reviewed.
Multi-Language Multi-Speaker Acoustic Modeling for LSTM-RNN Based Statistical Parametric Speech Synthesis
  • Bo Li, H. Zen
  • Computer Science
    INTERSPEECH
  • 2016
TLDR
A long short-term memory (LSTM) recurrent neural network (RNN) based statistical parametric speech synthesis system that uses data from multiple languages and speakers that can synthesize speech in multiple languages from a single model while maintaining naturalness is presented.
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
Directly modeling voiced and unvoiced components in speech waveforms by neural networks
  • K. Tokuda, H. Zen
  • Computer Science
    2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2016
TLDR
A novel acoustic model based on neural networks for statistical parametric speech synthesis that can generate speech waveforms approximating natural speechWaveforms.
Directly modeling speech waveforms by neural networks for statistical parametric speech synthesis
  • K. Tokuda, H. Zen
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
Experimental results show that the proposed approach can directly maximize the likelihood defined at the waveform domain.
Unit selection in a concatenative speech synthesis system using a large speech database
  • Andrew J. Hunt, A. Black
  • Computer Science
    1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings
  • 1996
TLDR
It is proposed that the units in a synthesis database can be considered as a state transition network in which the state occupancy cost is the distance between a database unit and a target, and the transition cost is an estimate of the quality of concatenation of two consecutive units.
...
1
2
3
4
5
...