Statistical parametric speech synthesis using deep neural networks

@article{Zen2013StatisticalPS,
  title={Statistical parametric speech synthesis using deep neural networks},
  author={Heiga Zen and Andrew W. Senior and Mike Schuster},
  journal={2013 IEEE International Conference on Acoustics, Speech and Signal Processing},
  year={2013},
  pages={7962-7966}
}
  • H. Zen, A. Senior, M. Schuster
  • Published 26 May 2013
  • Computer Science
  • 2013 IEEE International Conference on Acoustics, Speech and Signal Processing
Conventional approaches to statistical parametric speech synthesis typically use decision tree-clustered context-dependent hidden Markov models (HMMs) to represent probability densities of speech parameters given texts. Speech parameters are generated from the probability densities to maximize their output probabilities, then a speech waveform is reconstructed from the generated parameters. This approach is reasonably effective but has a couple of limitations, e.g. decision trees are… 
Statistical parametric speech synthesis: from HMM to LSTM-RNN
TLDR
The progress of acoustic modeling in SPSS from the HMM to the LSTM-RNN is reviewed.
Pre-Training of DNN-Based Speech Synthesis Based on Bidirectional Conversion between Text and Speech
TLDR
Experimental results in a speech synthesis task show that pre-trained DNN-based systems using the proposed method outperformed randomly-initialized DNN -based systems, especially when the amount of training data is limited.
Constructing a Deep Neural Network Based Spectral Model for Statistical Speech Synthesis
TLDR
A statistical parametric speech synthesis system that models high-dimensional spectral amplitudes directly using the DNN framework to improve modelling of spectral fine structures is proposed and Experimental results show that the proposed technique increases the quality of synthetic speech.
Temporal modeling in neural network based statistical parametric speech synthesis
TLDR
Subjective listening test results show that the proposed approach improves the naturalness of synthesized speech andVariations of the proposed neural network structure are also discussed.
DNN-Based Duration Modeling for Synthesizing Short Sentences
TLDR
Experimental results of objective evaluations show that DNNs can outperform previous state-of-the-art solutions in duration modeling, and the prediction of phone durations in Text-to-Speech (TTS) systems using feedforward Dnns in case of short sentences.
Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis
  • H. Zen, A. Senior
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
TLDR
Experimental results in objective and subjective evaluations show that the use of the mixture density output layer improves the prediction accuracy of acoustic features and the naturalness of the synthesized speech.
The effect of neural networks in statistical parametric speech synthesis
TLDR
Experimental results show that the use of a DNN as acoustic models is effective and the parameter generation combined with aDNN improves the naturalness of synthesized speech.
Speech Synthesis Based on Hidden Markov Models and Deep Learning
TLDR
The results indicate that HMM-voices can be improved using this approach in its spectral characteristics, but additional research should be conducted to improve other parameters of the voice signal, such as energy and fundamental frequency, to obtain more natural sounding voices.
An Improved Speech Synthesis Algorithm with Post filter Parameters Based on Deep Neural Network
TLDR
Deep Neural Network is put forward to replace clustering decision tree, and a post filter-parameter-based speech synthesis improvement algorithm is proposed that enhances the formant region of synthesized speech spectrum by selecting the most optimized filter parameter according to the flatness of spectrum.
Statistical Parametric Speech Synthesis Using Bottleneck Representation From Sequence Auto-encoder
TLDR
This paper uses recurrent neural network based auto-encoder to show that it is indeed possible to map units of varying duration to a single vector and uses this acoustic representation at unit-level to synthesize speech using deep neural networkbased statistical parametric speech synthesis technique.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 43 REFERENCES
Deep Neural Networks for Acoustic Modeling in Speech Recognition
TLDR
This paper provides an overview of this progress and repres nts the shared views of four research groups who have had recent successes in using deep neural networks for a coustic modeling in speech recognition.
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
TLDR
This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
A Hidden Semi-Markov Model-Based Speech Synthesis System
TLDR
Subjective listening test results show that use of HSMMs improves the reported naturalness of synthesized Speech Synthesis, which can be viewed as an HMM with explicit state duration PDFs.
Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR
TLDR
It is demonstrated that a few sentences uttered by a target speaker are sufficient to adapt not only voice characteristics but also prosodic features, and synthetic speech generated from adapted models using only four sentences is very close to that from speaker dependent models trained using 450 sentences.
Simultaneous modeling of spectrum, pitch and duration in HMM-based speech synthesis
TLDR
An HMM-based speech synthesis system in which spectrum, pitch and state duration are modeled simultaneously in a unified framework of HMM is described.
Context adaptive training with factorized decision trees for HMM-based statistical parametric speech synthesis
TLDR
Experiments on a word-level emphasis synthesis task show that all context adaptive training approaches can outperform the standard full-context-dependent HMM approach, however, the MLLR based system achieved the best performance.
Acoustic modeling with contextual additive structure for HMM-based speech recognition
TLDR
Experimental results show that the proposed technique improves phoneme recognition accuracy with fewer number of distributions than the conventional triphone HMMs.
Product of Experts for Statistical Parametric Speech Synthesis
TLDR
Experimental results show that the PoE framework provides both a mathematically elegant way to train multiple acoustic models jointly and significant improvements in the quality of the synthesized speech.
Speech parameter generation algorithms for HMM-based speech synthesis
This paper derives a speech parameter generation algorithm for HMM-based speech synthesis, in which the speech parameter sequence is generated from HMMs whose observation vector consists of a
A Speech Parameter Generation Algorithm Considering Global Variance for HMM-Based Speech Synthesis
TLDR
A generation algorithm considering not only the HMM likelihood maximized in the conventional algorithm but also a likelihood for a global variance of the generated trajectory works as a penalty for the over-smoothing.
...
1
2
3
4
5
...