Discrete acoustic space for an efficient sampling in neural text-to-speech

  Discrete acoustic space for an efficient sampling in neural text-to-speech
  Marek Střelec and Jonas Rohnke and Antonio Bonafonte and Mateusz Lajszczak and Trevor Wood
We present an SVQ-VAE architecture using a split vector quantizer for NTTS, as an enhancement to the well-known VAE and VQ-VAE architectures. Compared to these previous architectures, our proposed model retains the benefits of using an utterance-level bottleneck, while reducing the associated loss of representation power. We train the model on recordings in the highly expressive task-oriented dialogues domain and show that SVQ-VAE achieves a statistically significant improvement in naturalness… 

