• Publications
  • Influence
Merlin: An Open Source Neural Network Speech Synthesis System
The Merlin speech synthesis toolkit for neural network-based speech synthesis takes linguistic features as input, and employs neural networks to predict acoustic features, which are then passed to a vocoder to produce the speech waveform.
ASVspoof 2015: the first automatic speaker verification spoofing and countermeasures challenge
The ASVspoof initiative aims to overcome the lack of standards through the provision of standard corpora, protocols and metrics to support a common evaluation of automatic speaker verification technology.
Spoofing and countermeasures for speaker verification: A survey
A survey of past work and priority research directions for the future is provided, showing that future research should address the lack of standard datasets and the over-fitting of existing countermeasures to specific, known spoofing attacks.
Deep neural networks employing Multi-Task Learning and stacked bottleneck features for speech synthesis
It is shown that the hidden representation used within a DNN can be improved through the use of Multi-Task Learning, and that stacking multiple frames of hidden layer activations (stacked bottleneck features) also leads to improvements.
Detecting Converted Speech and Natural Speech for anti-Spoofing Attack in Speaker Recognition
Experiments show that the performance of the features derived from phase spectrum outperform the melfrequency cepstral coefficients (MFCCs) tremendously: even without converted speech for training, the equal error rate (EER) is reduced from 20.20% of MFCCs to 2.35%.
A study of speaker adaptation for DNN-based speech synthesis
An experimental analysis of speaker adaptation for DNN-based speech synthesis at different levels and systematically analyse the performance of each individual adaptation technique and that of their combinations.
Synthetic speech detection using temporal modulation feature
From the synthetic speech detection results, the modulation features provide complementary information to magnitude/phase features, and the best detection performance is obtained by fusing phase modulation features and phase features, yielding an equal error rate.
Automatic prosody prediction and detection with Conditional Random Field (CRF) models
Experiments performed on Boston University Radio Speech Corpus show that CRF models trained on the proposed rich contextual features can improve the accuracy of prosody prediction and detection in both speaker-dependent and speaker-independent cases.
Investigating gated recurrent networks for speech synthesis
  • Zhizheng Wu, S. King
  • Computer Science
    IEEE International Conference on Acoustics…
  • 11 January 2016
This work attempts to answer two questions: why do LSTMs work well as a sequence model for SPSS; and which component (e.g., input gate, output gate, forget gate) is most important.
The Voice Conversion Challenge 2016
The design of the challenge, its result, and a future plan to share views about unsolved problems and challenges faced by the current VC techniques are summarized.