Linear Networks Based Speaker Adaptation for Speech Synthesis

@article{Huang2018LinearNB,
  title={Linear Networks Based Speaker Adaptation for Speech Synthesis},
  author={Zhiying Huang and Heng Lu and Ming Lei and Zhijie Yan},
  journal={2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2018},
  pages={5319-5323}
}
  • Zhiying Huang, Heng Lu, +1 author Zhijie Yan
  • Published 5 March 2018
  • Computer Science, Engineering
  • 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Speaker adaptation methods aim to create fair quality synthesis speech voice font for target speakers while only limited resources available. Recently, as deep neural networks based statistical parametric speech synthesis (SPSS) methods become dominant in SPSS TTS back-end modeling, speaker adaptation under the neural network based SPSS framework has also became an important task. In this paper, linear networks (LN) is inserted in multiple neural network layers and fine-tuned together with… Expand
Scaling and Bias Codes for Modeling Speaker-Adaptive DNN-Based Speech Synthesis Systems
TLDR
This paper introduces the use of scaling and bias codes as generalized means for speaker-adaptive transformation and shows that the proposed method can improve the performance of speaker adaptation compared with speaker adaptation based on the conventional input code. Expand
Voice Conversion towards Arbitrary Speakers With Limited Data
TLDR
A speaker-adaptive voice conversion (SAVC) system, which accomplishes voice conversion towards arbitrary speakers with limited data, is proposed and two different adaptive approaches are explored: adaptation on the whole MSVC model or additional linear-hidden layers (AHL). Expand
A Unified Speaker Adaptation Method for Speech Synthesis using Transcribed and Untranscribed Speech with Backpropagation
TLDR
Surprisingly, adaptation with untranscribed speech surpassed the transcribed counterpart in the subjective test, which reveals the limitations of the conventional acoustic model and hints at potential directions for improvements. Expand
Cumulative Adaptation for BLSTM Acoustic Models
TLDR
i-vectors were used as an input to the neural network to perform instantaneous speaker and environment adaptation, providing 8\% relative improvement in word error rate on the NIST Hub5 2000 evaluation test set. Expand
An Evaluation of Postfiltering for Deep Learning Based Speech Synthesis with Limited Data
TLDR
Results show that even when starting from as little as 5 minutes of speech recordings, the postfiltering process improves the quality of the synthetic speech output, so it can, therefore, be used as a training strategy for TTS systems where sufficient high-quality data is not available. Expand
Comparison of BLSTM-Layer-Specific Affine Transformations for Speaker Adaptation
TLDR
It is observed that applying affine transformations result in consistent relative word error rate reductions ranging from 6% to 11% depending on the task and the degree of mismatch between training and test data. Expand
Speaking style adaptation in Text-To-Speech synthesis using Sequence-to-sequence models with attention
TLDR
This study proposes a transfer learning method to adapt a sequence-to-sequence based TTS system of normal speaking style to Lombard style and results indicated that an adaptation system with the WaveNet vocoder clearly outperformed the conventional deep neural network based T TS system in synthesis of Lombard speech. Expand
The IOA-ThinkIT system for Blizzard Challenge 2019
This paper presents the IOA-ThinkIT team’s text-to-speech system for blizzard challenge 2019. A statistical parametric speech synthesis based system was built with improvements in both front-end textExpand
A Survey on Neural Speech Synthesis
TLDR
A comprehensive survey on neural TTS is conducted, aiming to provide a good understanding of current research and future trends, and focuses on the key components in neural T TS, including text analysis, acoustic models, and vocoders. Expand
Lombard Speech Synthesis Using Transfer Learning in a Tacotron Text-to-Speech System
TLDR
The subjective and objective evaluation results indicated that the proposed adaptation system coupled with the WaveNet vocoder clearly outperformed the conventional deep neural network based TTS system in the synthesis of Lombard speech. Expand
...
1
2
...

References

SHOWING 1-10 OF 24 REFERENCES
A study of speaker adaptation for DNN-based speech synthesis
TLDR
An experimental analysis of speaker adaptation for DNN-based speech synthesis at different levels and systematically analyse the performance of each individual adaptation technique and that of their combinations. Expand
Speaker Representations for Speaker Adaptation in Multiple Speakers' BLSTM-RNN-Based Speech Synthesis
TLDR
Experimental results show that the speaker representations input to the first layer of acoustic model can effectively control speaker identity during speaker adaptive training, thus improving the synthesized speech quality of speakers included in training phase. Expand
Adapting and controlling DNN-based speech synthesis using input codes
TLDR
Experimental results show that high-performance multi-speaker models can be constructed using the proposed code vectors with a variety of encoding schemes, and that adaptation and manipulation can be performed effectively using the codes. Expand
Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis
TLDR
This paper proposes an approach to model multiple speakers TTS with a general DNN, where the same hidden layers are shared among different speakers while the output layers are composed of speaker-dependent nodes explaining the target of each speaker. Expand
Robust Speaker-Adaptive HMM-Based Text-to-Speech Synthesis
TLDR
A speaker-adaptive HMM-based speech synthesis system that employs speaker adaptation, feature-space adaptive training, mixed-gender modeling, and full-covariance modeling using CSMAPLR transforms, in addition to several other techniques that have proved effective in previous systems are described. Expand
Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models
This paper proposes a simple yet effective model-based neural network speaker adaptation technique that learns speaker-specific hidden unit contributions given adaptation data, without requiring anyExpand
HSMM-Based Model Adaptation Algorithms for Average-Voice-Based Speech Synthesis
TLDR
Several speaker adaptation algorithms and MAP modification are described to develop consistent method for synthesizing speech in a unified way for arbitrary amount of the speech data. Expand
Constrained structural maximum a posteriori linear regression for average-voice-based speech synthesis
TLDR
The proposed constrained structural maximum a posteriori linear regression algorithm is incorporated into HSMM-based speech synthesis system and it is shown that CSMAPLR adaptation provides more similar synthetic speech to the target speaker than CMLLR and SMAPLR adaptation from the result of subjective evaluation test. Expand
Average-Voice-Based Speech Synthesis Using HSMM-Based Speaker Adaptation and Adaptive Training
TLDR
This work utilizes the framework of the hidden semi-Markov model (HSMM), which is an HMM having explicit state duration distributions, and applies an HSMM-based model adaptation algorithm to simultaneously transform both the state output and state duration distribution. Expand
Adaptation of pitch and spectrum for HMM-based speech synthesis using MLLR
TLDR
It is demonstrated that a few sentences uttered by a target speaker are sufficient to adapt not only voice characteristics but also prosodic features, and synthetic speech generated from adapted models using only four sentences is very close to that from speaker dependent models trained using 450 sentences. Expand
...
1
2
3
...