• Corpus ID: 239768725

Discrete acoustic space for an efficient sampling in neural text-to-speech

  title={Discrete acoustic space for an efficient sampling in neural text-to-speech},
  author={Marek Střelec and Jonas Rohnke and Antonio Bonafonte and Mateusz Lajszczak and Trevor Wood},
We present an SVQ-VAE architecture using a split vector quantizer for NTTS, as an enhancement to the well-known VAE and VQ-VAE architectures. Compared to these previous architectures, our proposed model retains the benefits of using an utterance-level bottleneck, while reducing the associated loss of representation power. We train the model on recordings in the highly expressive task-oriented dialogues domain and show that SVQ-VAE achieves a statistically significant improvement in naturalness… 

Figures and Tables from this paper


Prosodic Representation Learning and Contextual Sampling for Neural Text-to-Speech
  • S. Karlapati, Ammar Abbas, +4 authors Thomas Drugman
  • Computer Science, Engineering
    ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
  • 2021
Kathaka, a model trained with a novel two-stage training process for neural speech synthesis with contextually appropriate prosody, and a novel method to sample from this learnt prosodic distribution using the contextual information available in text.
A learned conditional prior for the VAE acoustic space of a TTS system
A novel method to compute an informative prior for the VAE latent space of a neural textto-speech (TTS) system is proposed, which aims to sample with more prosodic variability, while gaining controllability over the latent space’s structure.
Natural TTS Synthesis by Conditioning Wavenet on MEL Spectrogram Predictions
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps
Expressive Speech Synthesis via Modeling Expressions with Variational Autoencoder
Experiments show the VAE helps VoiceLoop to generate higher quality speech and to control the expressions in its synthesized speech by incorporating global characteristics into the speech generating process.
Using generative modelling to produce varied intonation for speech synthesis
This work uses variational autoencoders (VAEs) which explicitly place the most "average" data close to the mean of the Gaussian prior to propose that by moving towards the tails of the prior distribution, the model will transition towards generating more idiosyncratic, varied renditions.
Learning Latent Representations for Style Control and Transfer in End-to-end Speech Synthesis
The Variational Autoencoder (VAE) is introduced to an end-to-end speech synthesis model, to learn the latent representation of speaking styles in an unsupervised manner and shows good properties such as disentangling, scaling, and combination.
In Other News: a Bi-style Text-to-speech Model for Synthesizing Newscaster Voice with Limited Data
This paper proposes a bi-style model that can synthesise both neutral-style and newscaster-style speech via a one-hot vector which factorises the two styles, and proposes conditioning the model on contextual word embeddings, and extensively evaluate it against neutral NTTS, and neutral concatenative-based synthesis.
Universal Neural Vocoding with Parallel Wavenet
It is shown that the proposed universal vocoder significantly outperforms speaker-dependent vocoders overall and performs better than several existing neural vocoder architectures in terms of naturalness and universality when tested on more than 300 open-source voices.
Towards conversational speech synthesis; lessons learned from the expressive speech processing project
This paper shows that because variation in voice quality plays a signicant part in the transmission of interpersonal and affect-related social information, this feature should be given priority in future speech synthesis research.
Neural Discrete Representation Learning
Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.