A study of speaker adaptation for DNN-based speech synthesis

@inproceedings{Wu2015ASO,
  title={A study of speaker adaptation for DNN-based speech synthesis},
  author={Zhizheng Wu and Pawel Swietojanski and Christophe Veaux and Steve Renals and Simon King},
  booktitle={INTERSPEECH},
  year={2015}
}
A major advantage of statistical parametric speech synthesis (SPSS) over unit-selection speech synthesis is its adaptability and controllability in changing speaker characteristics and speaking style. Recently, several studies using deep neural networks (DNNs) as acoustic models for SPSS have shown promising results. However, the adaptability of DNNs in SPSS has not been systematically studied. In this paper, we conduct an experimental analysis of speaker adaptation for DNN-based speech… 

Figures and Tables from this paper

On the training of DNN-based average voice model for speech synthesis
  • Shan Yang, Zhizheng Wu, Lei Xie
  • Computer Science
    2016 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA)
  • 2016
TLDR
This work performs a systematic analysis of the training of multi-speaker average voice model (AVM), which is the foundation of adaptability and controllability of a DNN-based speech synthesis system.
Speaker Adaptation of Various Components in Deep Neural Network based Speech Synthesis
TLDR
This paper proposes a multiple DNN-based speech synthesis system, in which several components are represented based on feed-forward DNNs, and investigates the effectiveness of speaker adaptation for various essential components in deep neural network based speech synthesis, including acoustic models, acoustic feature extraction, and post-filters.
Adapting and controlling DNN-based speech synthesis using input codes
TLDR
Experimental results show that high-performance multi-speaker models can be constructed using the proposed code vectors with a variety of encoding schemes, and that adaptation and manipulation can be performed effectively using the codes.
Speaker Adaptation of Acoustic Model using a Few Utterances in DNN-based Speech Synthesis Systems
TLDR
This paper presents a novel technique to estimate a speaker-specific model using a partial copy of the speaker-independent model by creating a separate parallel branch stemmed from the intermediate hidden layer of the base network.
Speaker Adaptation for Speech Synthesis Based on Deep Neural Networks Using Hidden Semi-Markov Model Structures
TLDR
The proposed speaker adaptation technique using hidden semi-Markov model (HSMM) structures using a special type of mixture density network (MDN) called MDN-HSMM, which outputs the parameters of HSMMs, is applied and improves the naturalness and speaker similarity of the synthesized speech.
Linear Networks Based Speaker Adaptation for Speech Synthesis
TLDR
Object measurement and subjective tests show that LN with LRPD decomposition performs most stable when adaptation data is extremely limited, and the best speaker adaptation (SA) model with only 200 adaptation utterances achieves comparable quality with speaker dependent (SD) model trained with 1000 utterances, in both naturalness and similarity to target speaker.
Speaker Representations for Speaker Adaptation in Multiple Speakers' BLSTM-RNN-Based Speech Synthesis
TLDR
Experimental results show that the speaker representations input to the first layer of acoustic model can effectively control speaker identity during speaker adaptive training, thus improving the synthesized speech quality of speakers included in training phase.
Unsupervised Speaker Adaptation for DNN-based Speech Synthesis using Input Codes
TLDR
The proposed speaker-adaptation technique makes it possible to rapidly construct a voice for the target speaker in DNN-based speech synthesis, and it is expected that inputting the estimated speaker-similarity vector into the multi-speaker speech-synthesis model can generate synthetic speech that resembles the target speakers' voice.
A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation
TLDR
Surprisingly, adaptation with untranscribed speech surpassed the transcribed counterpart in the subjective test, which reveals the limitations of the conventional acoustic model and hints at potential directions for improvements.
An Investigation of DNN-Based Speech Synthesis Using Speaker Codes
TLDR
Experimental results showed that the proposed model outperformed the conventional speaker-dependent DNN when the model architecture was set at optimal for the amount of training data of the selected target speaker.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 33 REFERENCES
Preliminary Work on Speaker Adaptation for DNN-Based Speech Synthesis
TLDR
This work focuses on the exploitation of auxiliary information such as gender, speaker identity or age during the DNN training process and suggests that the proposed method is superior to standard feature transformations.
Analysis of Speaker Adaptation Algorithms for HMM-Based Speech Synthesis and a Constrained SMAPLR Adaptation Algorithm
TLDR
A new adaptation algorithm is proposed called constrained structural maximum a posteriori linear regression (CSMAPLR) whose derivation is based on the knowledge obtained in this analysis and on the results of comparing several conventional adaptation algorithms.
I-vector-based speaker adaptation of deep neural networks for French broadcast audio transcription
TLDR
This paper shows how this i-vector based speaker adaptation can be used to perform blind speaker adaptation of hybrid DNN-HMM speech recognition system and reports excellent results on a French language audio transcription task.
Rapid and effective speaker adaptation of convolutional neural network based models for speech recognition
TLDR
Experimental results on the TIMIT dataset demonstrates that both methods are quite effective in terms of adapting CNN based acoustic models and can achieve even better performance by combining these two methods together.
Speaker adaptation of neural network acoustic models using i-vectors
TLDR
This work proposes to adapt deep neural network acoustic models to a target speaker by supplying speaker identity vectors (i-vectors) as input features to the network in parallel with the regular acoustic features for ASR, comparable in performance to DNNs trained on speaker-adapted features with the advantage that only one decoding pass is needed.
On the training aspects of Deep Neural Network (DNN) for parametric TTS synthesis
TLDR
Experimental results show that DNN can outperform the conventional HMM, which is trained in ML first and then refined by MGE, and both objective and subjective measures indicate thatDNN can synthesize speech better than HMM-based baseline.
Adaptation of deep neural network acoustic models using factorised i-vectors
TLDR
The i-vectors are viewed as the weights of a cluster adaptive training (CAT) system, where the underlying models are GMMs rather than HMMs, which allows the factorisation approaches developed for CAT to be directly applied.
Deep mixture density networks for acoustic modeling in statistical parametric speech synthesis
  • H. Zen, A. Senior
  • Computer Science
    2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2014
TLDR
Experimental results in objective and subjective evaluations show that the use of the mixture density output layer improves the prediction accuracy of acoustic features and the naturalness of the synthesized speech.
Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models
This paper proposes a simple yet effective model-based neural network speaker adaptation technique that learns speaker-specific hidden unit contributions given adaptation data, without requiring any
Statistical parametric speech synthesis using deep neural networks
TLDR
This paper examines an alternative scheme that is based on a deep neural network (DNN), the relationship between input texts and their acoustic realizations is modeled by a DNN, and experimental results show that the DNN- based systems outperformed the HMM-based systems with similar numbers of parameters.
...
1
2
3
4
...