SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization

  title={SQ-VAE: Variational Bayes on Discrete Representation with Self-annealed Stochastic Quantization},
  author={Yuhta Takida and Takashi Shibuya and Wei-Hsiang Liao and Chieh-Hsin Lai and Junki Ohmura and Toshimitsu Uesaka and Naoki Murata and Shusuke Takahashi and Toshiyuki Kumakura and Yuki Mitsufuji},
One noted issue of vector-quantized variational autoencoder (VQ-VAE) is that the learned discrete representation uses only a fraction of the full capacity of the codebook, also known as codebook collapse. We hypothesize that the training scheme of VQ-VAE, which involves some carefully designed heuristics, underlies this issue. In this paper, we propose a new training scheme that extends the standard VAE via novel stochastic dequantization and quantization, called stochastically quantized… 
1 Citations

A Versatile Diffusion-based Generative Refiner for Speech Enhancement

A DNN-based generative refiner aiming to improve perceptual speech quality pre-processed by an SE method and can be a versatile post-processing module w.r.t. SE methods and has high potential in terms of modularity.



Vector Quantization-Based Regularization for Autoencoders

This paper introduces a quantization-based regularizer in the bottleneck stage of autoencoder models to learn meaningful latent representations and shows that the proposed regularization method results in improved latent representations for both supervised learning and clustering downstream tasks when compared to autoencoders using other bottleneck structures.

From Variational to Deterministic Autoencoders

It is shown, in a rigorous empirical study, that the proposed regularized deterministic autoencoders are able to generate samples that are comparable to, or better than, those of VAEs and more powerful alternatives when applied to images as well as to structured data such as molecules.

Neural Discrete Representation Learning

Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.

Theory and Experiments on Vector Quantized Autoencoders

This work investigates an alternate training technique for VQ-VAE, inspired by its connection to the Expectation Maximization (EM) algorithm, and develops a non-autoregressive machine translation model whose accuracy almost matches a strong greedy autoregressive baseline Transformer, while being 3.3 times faster at inference.

Deterministic Decoding for Discrete Data in Variational Autoencoders

This paper studies a VAE model with a deterministic decoder (DD-VAE) for sequential data that selects the highest-scoring tokens instead of sampling, and proposes a new class of bounded support proposal distributions and derives Kullback-Leibler divergence for Gaussian and uniform priors.

Generating Diverse High-Fidelity Images with VQ-VAE-2

It is demonstrated that a multi-scale hierarchical organization of VQ-VAE, augmented with powerful priors over the latent codes, is able to generate samples with quality that rivals that of state of the art Generative Adversarial Networks on multifaceted datasets such as ImageNet, while not suffering from GAN's known shortcomings such as mode collapse and lack of diversity.

Hyperspherical Variational Auto-Encoders

This work proposes using a von Mises-Fisher distribution instead of a Gaussian distribution for both the prior and posterior of the Variational Auto-Encoder, leading to a hyperspherical latent space.

InfoVAE: Balancing Learning and Inference in Variational Autoencoders

It is shown that the proposed Info-VAE model can significantly improve the quality of the variational posterior and can make effective use of the latent features regardless of the flexibility of the decoding distribution.

Vector-quantized neural networks for acoustic unit discovery in the ZeroSpeech 2020 challenge

Two neural models are proposed to tackle the challenge of discrete representations of speech that separate phonetic content from speaker-specific details, using vector quantization to map continuous features to a finite set of codes.

Unsupervised Speech Representation Learning Using WaveNet Autoencoders

A regularization scheme is introduced that forces the representations to focus on the phonetic content of the utterance and report performance comparable with the top entries in the ZeroSpeech 2017 unsupervised acoustic unit discovery task.