• Publications
  • Influence
Merlin: An Open Source Neural Network Speech Synthesis System
The Merlin speech synthesis toolkit for neural network-based speech synthesis takes linguistic features as input, and employs neural networks to predict acoustic features, which are then passed to a vocoder to produce the speech waveform.
Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis
It is shown that the hidden representation used within a DNN can be improved through the use of Multi-Task Learning, and that stacking multiple frames of hidden layer activations (stacked bottleneck features) also leads to improvements.
Sentence-level control vectors for deep neural network speech synthesis
Results show that the global prosodic characteristics of synthetic speech can be controlled simply and robustly at run time by supplementing basic linguistic features with sentencelevel control vectors which are novel but designed to be consistent with those observed in the training corpus.
Unsupervised and lightly-supervised learning for rapid construction of TTS systems in multiple languages from 'found' data: evaluation and analysis
Techniques for building text-to-speech frontends in a way that avoids the need for language-specific expert knowledge, but instead relies on universal resources and unsupervised learning from unannotated data to ease system development are presented.
Thousands of Voices for HMM-Based Speech Synthesis–Analysis and Application of TTS Systems Built on Various ASR Corpora
This paper demonstrates the thousands of voices for HMM-based speech synthesis that are made from several popular ASR corpora such as the Wall Street Journal, Resource Management, Globalphone, and SPEECON databases.
Unsupervised learning for text-to-speech synthesis
The distributional analysis proposed here places the textual objects analysed in a continuous-valued space, rather than specifying a hard categorisation of those objects, so that the models generalise over objects’ surface forms in a way that is acoustically relevant.
The CSTR/EMIME HTS system for Blizzard Challenge 2010
The European Community’s Seventh Framework Programme (FP7/2007-2013) under Grant agreement 213845 (the EMIME project)
HMM-based synthesis of child speech
This work compared 6 different configurations of the synthesiser, using both speaker-dependent and speaker-adaptive modelling techniques, and using varying amounts of data, to build a statistical parametric synthesiser using the HMM-based system HTS.
From HMMS to DNNS: Where do the improvements come from?
It is found that replacing decision trees with DNNs and moving from state-level to frame-level predictions both significantly improve listeners' naturalness ratings of synthetic speech produced by the systems.
Combining a vector space representation of linguistic context with a deep neural network for text-to-speech synthesis
This work proposes to combine previous work on vector-space representations of linguistic context, which have the added advantage of working directly from textual input, and Deep Neural Networks (DNNs), which can directly accept such continuous representations as input.