Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction

  title={Improved Prosody from Learned F0 Codebook Representations for VQ-VAE Speech Waveform Reconstruction},
  author={Yi Zhao and Haoyu Li and Cheng-I Lai and Jennifer Williams and Erica Cooper and Junichi Yamagishi},
Vector Quantized Variational AutoEncoders (VQ-VAE) are a powerful representation learning framework that can discover discrete groups of features from a speech signal without supervision. Until now, the VQ-VAE architecture has previously modeled individual types of speech features, such as only phones or only F0. This paper introduces an important extension to VQ-VAE for learning F0-related suprasegmental information simultaneously along with traditional phone features.The proposed framework… 

Figures and Tables from this paper

Learning Disentangled Phone and Speaker Representations in a Semi-Supervised VQ-VAE Paradigm

A speaker encoder and speaker VQ codebook that learns global speaker characteristics entirely separate from the existing sub-phone codebooks is incorporated, which improves objective measures of speech synthesis quality and provides learned representations that are meaningful.

Speech Resynthesis from Discrete Disentangled Self-Supervised Representations

To generate disentangled representation, low-bitrate representations are extracted for speech content, prosodic information, and speaker identity to synthesize speech in a controllable manner using self-supervised discrete representations.

Vaw-Gan For Disentanglement And Recomposition Of Emotional Elements In Speech

This paper proposes a speaker-dependent EVC framework based on VAW-GAN, that includes a spectral encoder that disentangles emotion and prosody (F0) information from spectral features and a prosodic encoder which disentangled emotion modulation of prosody from linguistic prosody.

Analysis of Voice Conversion and Code-Switching Synthesis Using VQ-VAE

This paper presents an analysis of speech synthesis quality achieved by simultaneously performing voice conversion and language code-switching using multilingual VQ-VAE speech synthesis in German,

Expressive TTS Training With Frame and Style Reconstruction Loss

This study is the first study to incorporate utterance level perceptual quality as a loss function into Tacotron training for improved expressiveness, and marks a departure from the style token paradigm.

End-to-End Text-to-Speech Using Latent Duration Based on VQ-VAE

A new TTS framework is proposed that incorporates duration as a discrete latent variable to TTS and enables joint optimization of whole modules from scratch and provides a theoretical explanation to justify the method.

Taming Visually Guided Sound Generation

This work proposes a single model capable of generating visually relevant, high-fidelity sounds prompted with a set of frames from open-domain videos in less time than it takes to play it on a single GPU.

Exploring Disentanglement with Multilingual and Monolingual VQ-VAE

This work examines the content and usefulness of disentangled phone and speaker representations from two separately trained VQ-VAE systems: one trained on multilingual data and another trained on

NSVQ: Noise Substitution in Vector Quantization for Machine Learning

This study proposes avector quantization technique called NSVQ, which approximates the vector quantization behavior by substituting a multiplicative noise so that it can be used for machine learning problems.

When Creative AI Meets Conversational AI

  • Xianchao Wu
  • Art, Computer Science
    Journal of Natural Language Processing
  • 2021
In this year’s (27th) Natural Language Processing (NLP 2021) conference, the first workshop named “when creative AI meets conversational AI”, or, briefly “CAI+CAI=CAI” is proposed and organized.



Unsupervised Speech Representation Learning Using WaveNet Autoencoders

A regularization scheme is introduced that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.

Group Latent Embedding for Vector Quantized Variational Autoencoder in Non-Parallel Voice Conversion

The proposed Group Latent Embedding for Vector Quantized Variational Autoencoders used in nonparallel Voice Conversion significantly improves the acoustic quality of the VC syntheses compared to the traditional VQ-VAE while retaining the voice identity of the target speaker.

Low Bit-rate Speech Coding with VQ-VAE and a WaveNet Decoder

This work demonstrates that a neural network architecture based on VQ-VAE with a WaveNet decoder can be used to perform very low bit-rate speech coding with high reconstruction quality.

VQVAE Unsupervised Unit Discovery and Multi-scale Code2Spec Inverter for Zerospeech Challenge 2019

This proposed approach significantly improved the intelligibility (in CER), the MOS, and discrimination ABX scores compared to the official ZeroSpeech 2019 baseline or even the topline.

Deep Encoder-Decoder Models for Unsupervised Learning of Controllable Speech Synthesis

This paper shows that popular unsupervised training heuristics can be interpreted as variational inference in certain autoencoder models, and connects these models to VQ-VAEs, another, recently-proposed class of deep variational autoencoders, which can be derived from a very similar mathematical argument.

Unsupervised Speech Decomposition via Triple Information Bottleneck

SpeechSplit is among the first algorithms that can separately perform style transfer on timbre, pitch and rhythm without text labels and can blindly decompose speech into its four components by introducing three carefully designed information bottlenecks.

Crepe: A Convolutional Representation for Pitch Estimation

This paper proposes a data-driven pitch tracking algorithm, CREPE, which is based on a deep convolutional neural network that operates directly on the time-domain waveform, and evaluates the model's generalizability in terms of noise robustness.

Speech2Vec: A Sequence-to-Sequence Framework for Learning Word Embeddings from Speech

The proposed Speech2Vec model, a novel deep neural network architecture for learning fixed-length vector representations of audio segments excised from a speech corpus, is based on a RNN Encoder-Decoder framework, and borrows the methodology of skipgrams or continuous bag-of-words for training.

JVS corpus: free Japanese multi-speaker voice corpus

The JVS corpus is constructed, which contains voice data of 100 speakers in three styles (normal, whisper, and falsetto), which contains 30 hours of voice data including 22 hours of parallel normal voices.

WaveNet: A Generative Model for Raw Audio

WaveNet, a deep neural network for generating raw audio waveforms, is introduced; it is shown that it can be efficiently trained on data with tens of thousands of samples per second of audio, and can be employed as a discriminative model, returning promising results for phoneme recognition.