• Corpus ID: 18649557

TTS synthesis with bidirectional LSTM based recurrent neural networks

@inproceedings{Fan2014TTSSW,
  title={TTS synthesis with bidirectional LSTM based recurrent neural networks},
  author={Yuchen Fan and Yao Qian and Feng-Long Xie and Frank K. Soong},
  booktitle={INTERSPEECH},
  year={2014}
}
Feed-forward, Deep neural networks (DNN)-based text-tospeech (TTS) systems have been recently shown to outperform decision-tree clustered context-dependent HMM TTS systems [1, 4. [] Key Method In this paper, Recurrent Neural Networks (RNNs) with Bidirectional Long Short Term Memory (BLSTM) cells are adopted to capture the correlation or co-occurrence information between any two instants in a speech utterance for parametric TTS synthesis. Experimental results show that a hybrid system of DNN and BLSTM-RNN, i…

Figures and Tables from this paper

Voice conversion using deep Bidirectional Long Short-Term Memory based Recurrent Neural Networks
TLDR
This paper proposes a sequence-based conversion method using DBLSTM-RNNs to model not only the frame-wised relationship between the source and the target voice, but also the long-range context-dependencies in the acoustic trajectory.
DNN-Based Duration Modeling for Synthesizing Short Sentences
TLDR
Experimental results of objective evaluations show that DNNs can outperform previous state-of-the-art solutions in duration modeling, and the prediction of phone durations in Text-to-Speech (TTS) systems using feedforward Dnns in case of short sentences.
Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder
TLDR
Four neural network architectures are investigated and applied using this continuous vocoder to model F0, MVF, and Mel-Generalized Cepstrum for more natural sounding speech synthesis and show that the proposed framework converges faster and gives state-of-the-art speech synthesis performance while outperforming the conventional feed-forward DNN.
Deep Feed-Forward Sequential Memory Networks for Speech Synthesis
TLDR
Both objective and subjective experiments show that, compared with BLSTM TTS method, the DFSMN system can generate synthesized speech with comparable speech quality while drastically reduce model complexity and speech generation time.
Deep Elman recurrent neural networks for statistical parametric speech synthesis
LSTM Deep Neural Networks Postfiltering for Enhancing Synthetic Voices
TLDR
An application of long short-term memory (LSTM) deep neural networks as a postfiltering step in HMM-based speech synthesis, to obtain characteristics which are closer to those of natural speech.
Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis
TLDR
This work presents a novel approach which makes use of DNNs with stacked bottleneck features combined with a BLSTM architecture to model context and expressive variability and results from a subjective evaluation show a significant improvement of the bottleneck architecture over feed-forward DNN's.
Bidirectional LSTM Networks Employing Stacked Bottleneck Features for Expressive Speech-Driven Head Motion Synthesis
TLDR
This work presents a novel approach which makes use of DNNs with stacked bottleneck features combined with a BLSTM architecture to model context and expressive variability and results from a subjective evaluation show a significant improvement of the bottleneck architecture over feed-forward DNN's.
Improving Deep Neural Network Based Speech Synthesis through Contextual Feature Parametrization and Multi-Task Learning
TLDR
The bidirectional recurrent neural network architecture with long short term memory (BLSTM) units is adopted and trained with multi-task learning for DNN-based speech synthesis and the quality of synthesized speech has been improved.
Constructing long short-term memory based deep recurrent neural networks for large vocabulary speech recognition
  • Xiangang Li, Xihong Wu
  • Computer Science
    2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2015
TLDR
Alternative deep LSTM architectures are proposed and empirically evaluated on a large vocabulary conversational telephone speech recognition task and Experimental results demonstrate that the deep L STM networks benefit from the depth and yield the state-of-the-art performance on this task.
...
...

References

SHOWING 1-10 OF 24 REFERENCES
On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis
TLDR
Experimental results show that DNN can outperform the conventional HMM, which is trained in ML first and then refined by MGE, and both objective and subjective measures indicate thatDNN can synthesize speech better than HMM-based baseline.
Hybrid speech recognition with Deep Bidirectional LSTM
TLDR
The hybrid approach with DBLSTM appears to be well suited for tasks where acoustic modelling predominates, and the improvement in word error rate over the deep network is modest, despite a great increase in framelevel accuracy.
Speech recognition with deep recurrent neural networks
TLDR
This paper investigates deep recurrent neural networks, which combine the multiple levels of representation that have proved so effective in deep networks with the flexible use of long range context that empowers RNNs.
Feature engineering in Context-Dependent Deep Neural Networks for conversational speech transcription
TLDR
This work investigates the potential of Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, from a feature-engineering perspective to reduce the word error rate for speaker-independent transcription of phone calls.
Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis
TLDR
This work proposes to combine previous work on vector-space representations of linguistic context, which have the added advantage of working directly from textual input, and Deep Neural Networks (DNNs), which can directly accept such continuous representations as input.
Statistical parametric speech synthesis using deep neural networks
TLDR
This paper examines an alternative scheme that is based on a deep neural network (DNN), the relationship between input texts and their acoustic realizations is modeled by a DNN, and experimental results show that the DNN- based systems outperformed the HMM-based systems with similar numbers of parameters.
Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research Groups
TLDR
This article provides an overview of progress and represents the shared views of four research groups that have had recent successes in using DNNs for acoustic modeling in speech recognition.
Bidirectional recurrent neural networks
TLDR
It is shown how the proposed bidirectional structure can be easily modified to allow efficient estimation of the conditional posterior probability of complete symbol sequences without making any explicit assumption about the shape of the distribution.
Multi-distribution deep belief network for speech synthesis
TLDR
Compared with the predominant HMM-based approach, objective evaluation shows that the spectrum generated from DBN has less distortion, and the overall quality is comparable to that of context-independent HMM.
Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks
TLDR
This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing.
...
...