Waveform-Based Speaker Representations for Speech Synthesis

@inproceedings{Wan2018WaveformBasedSR,
  title={Waveform-Based Speaker Representations for Speech Synthesis},
  author={Moquan Wan and Gilles Degottex and Mark John Francis Gales},
  booktitle={INTERSPEECH},
  year={2018}
}
© 2018 International Speech Communication Association. All rights reserved. Speaker adaptation is a key aspect of building a range of speech processing systems, for example personalised speech synthesis. For deep-learning based approaches, the model parameters are hard to interpret, making speaker adaptation more challenging. One widely used method to address this problem is to extract a fixed length vector as speaker representation, and use this as an additional input to the task-specific… 
Phoneme Dependent Speaker Embedding and Model Factorization for Multi-speaker Speech Synthesis and Adaptation
TLDR
Experimental results confirm the adaptability of the proposed speaker embedding and model factorization structure and listening tests demonstrate that the proposed method can achieve better adaptation performance than baselines in terms of naturalness and speaker similarity.
The NLPR Speech Synthesis entry for Blizzard Challenge 2020
TLDR
The paper describes the NLPR speech synthesis system entry for Blizzard Challenge 2020 and the whole system structure, data pruning method, and the duration control will be in-troduced and discussed.

References

SHOWING 1-10 OF 18 REFERENCES
Adaptation of deep neural network acoustic models using factorised i-vectors
TLDR
The i-vectors are viewed as the weights of a cluster adaptive training (CAT) system, where the underlying models are GMMs rather than HMMs, which allows the factorisation approaches developed for CAT to be directly applied.
Front-End Factor Analysis For Speaker Verification
  • Florin Curelaru
  • Computer Science
    2018 International Conference on Communications (COMM)
  • 2018
TLDR
This paper investigates which configuration and which parameters lead to the best performance of an i-vectors/PLDA based speaker verification system and presents at the end some preliminary experiments in which the utterances comprised in the CSTR VCTK corpus were used besides utterances from MIT-MDSVC for training the total variability covariance matrix and the underlying PLDA matrices.
A Log Domain Pulse Model for Parametric Speech Synthesis
TLDR
A new signal model is proposed that leads to a simple synthesizer, without the need for ad-hoc tuning of model parameters, which adopts a combination of speech components that are additive in the log domain.
Deep neural networks for small footprint text-dependent speaker verification
TLDR
Experimental results show the DNN based speaker verification system achieves good performance compared to a popular i-vector system on a small footprint text-dependent speaker verification task and is more robust to additive noise and outperforms the i- vector system at low False Rejection operating points.
WaveNet: A Generative Model for Raw Audio
TLDR
WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.
An HMM-based speech synthesis system applied to English
This paper describes an HMM-based speech synthesis system (HTS), in which the speech waveform is generated from HMM themselves, and applies it to English speech synthesis using the general speech
Char2Wav: End-to-End Speech Synthesis
TLDR
Char2Wav is an end-to-end model for speech synthesis that learns to produce audio directly from text and is a bidirectional recurrent neural network with attention that produces vocoder acoustic features.
Analysis of i-vector Length Normalization in Speaker Recognition Systems
TLDR
The proposed approach deals with the nonGaussian behavior of i-vectors by performing a simple length normalization, which allows the use of probabilistic models with Gaussian assumptions that yield equivalent performance to that of more complicated systems based on Heavy-Tailed assumptions.
Unsupervised feature learning for audio classification using convolutional deep belief networks
In recent years, deep learning approaches have gained significant interest as a way of building hierarchical representations from unlabeled data. However, to our knowledge, these deep learning
The voice bank corpus: Design, collection and data analysis of a large regional accent speech database
  • C. Veaux, J. Yamagishi, Simon King
  • Computer Science
    2013 International Conference Oriental COCOSDA held jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE)
  • 2013
TLDR
The motivation and the processes involved in the design and recording of the Voice Bank corpus, specifically designed for the creation of personalised synthetic voices for individuals with speech disorders, are described.
...
...