# Fast Decoding in Sequence Models using Discrete Latent Variables

@article{Kaiser2018FastDI, title={Fast Decoding in Sequence Models using Discrete Latent Variables}, author={Łukasz Kaiser and Aurko Roy and Ashish Vaswani and Niki Parmar and Samy Bengio and Jakob Uszkoreit and Noam M. Shazeer}, journal={ArXiv}, year={2018}, volume={abs/1803.03382} }

Autoregressive sequence models based on deep neural networks, such as RNNs, Wavenet and the Transformer attain state-of-the-art results on many tasks. [... ] Key Method We first auto-encode the target sequence into a shorter sequence of discrete latent variables, which at inference time is generated autoregressively, and finally decode the output sequence from this shorter latent sequence in parallel. Expand

## Figures and Tables from this paper

## 144 Citations

Blockwise Parallel Decoding for Deep Autoregressive Models

- Computer ScienceNeurIPS
- 2018

This work proposes a novel blockwise parallel decoding scheme in which it makes predictions for multiple time steps in parallel then back off to the longest prefix validated by a scoring model, which allows for substantial theoretical improvements in generation speed when applied to architectures that can process output sequences in parallel.

Fast Structured Decoding for Sequence Models

- Computer ScienceNeurIPS
- 2019

This work designs an efficient approximation for Conditional Random Fields (CRF) for non-autoregressive sequence models, and proposes a dynamic transition technique to model positional contexts in the CRF and shows that while increasing little latency, this model could achieve significantly better translation performance than previous non- autoregressive models on different translation datasets.

Theory and Experiments on Vector Quantized Autoencoders

- Computer ScienceArXiv
- 2018

This work investigates an alternate training technique for VQ-VAE, inspired by its connection to the Expectation Maximization (EM) algorithm, and develops a non-autoregressive machine translation model whose accuracy almost matches a strong greedy autoregressive baseline Transformer, while being 3.3 times faster at inference.

Towards a better understanding of Vector Quantized Autoencoders

- Computer Science
- 2018

This work investigates an alternate training technique for VQ-VAE, inspired by its connection to the Expectation Maximization (EM) algorithm, and develops a non-autoregressive machine translation model whose accuracy almost matches a strong greedy autoregressive baseline Transformer, while being 3.3 times faster at inference.

Discretized Bottleneck in VAE: Posterior-Collapse-Free Sequence-to-Sequence Learning

- Computer ScienceArXiv
- 2020

This paper proposes a principled approach to eliminate the posterior-collapse issue in latent space by applying a discretized bottleneck in the latent space and imposes a shared discrete latent space where each input is learned to choose a combination of shared latent atoms as its latent representation.

STCN: Stochastic Temporal Convolutional Networks

- Computer ScienceICLR
- 2019

A hierarchy of stochastic latent variables that captures temporal dependencies at different time-scales is proposed, which achieves state of the art log-likelihoods across several tasks and is capable of predicting high-quality synthetic samples over a long-range temporal horizon in modeling of handwritten text.

Semi-Autoregressive Neural Machine Translation

- Computer ScienceEMNLP
- 2018

A novel model for fast sequence generation — the semi-autoregressive Transformer (SAT), which keeps the autoregressive property in global but relieves in local and thus are able to produce multiple successive words in parallel at each time step.

Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference using a Delta Posterior

- Computer ScienceAAAI
- 2020

Inspired by recent refinement-based approaches, LaNMT is proposed, a latent-variable non-autoregressive model with continuous latent variables and deterministic inference procedure that closes the performance gap between non- Autoregressive and autoregressive approaches on ASPEC Ja-En dataset with 8.6x faster decoding.

Retrieving Sequential Information for Non-Autoregressive Neural Machine Translation

- Computer ScienceACL
- 2019

Experimental results on three translation tasks show that the Reinforce-NAT surpasses the baseline NAT system by a significant margin on BLEU without decelerating the decoding speed and the FS-decoder achieves comparable translation performance to the autoregressive Transformer with considerable speedup.

Non-autoregressive Machine Translation with Disentangled Context Transformer

- Computer Science
- 2020

An attention-masking based model, called Disentangled Context (DisCo) transformer, that simultaneously generates all tokens given different contexts that achieves competitive, if not better, performance compared to the state of the art in nonautoregressive machine translation while significantly reducing decoding time on average.

## References

SHOWING 1-10 OF 59 REFERENCES

Discrete Autoencoders for Sequence Models

- Computer ScienceArXiv
- 2018

This work proposes to improve the representation in sequence models by augmenting current approaches with an autoencoder that is forced to compress the sequence through an intermediate discrete latent space, and introduces an improved semantic hashing technique.

Neural Machine Translation in Linear Time

- Computer ScienceArXiv
- 2016

The ByteNet decoder attains state-of-the-art performance on character-level language modelling and outperforms the previous best results obtained with recurrent networks and the latent alignment structure contained in the representations reflects the expected alignment between the tokens.

Sequence to Sequence Learning with Neural Networks

- Computer ScienceNIPS
- 2014

This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

Attention is All you Need

- Computer ScienceNIPS
- 2017

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Convolutional Sequence to Sequence Learning

- Computer ScienceICML
- 2017

This work introduces an architecture based entirely on convolutional neural networks, which outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT-French translation at an order of magnitude faster speed, both on GPU and CPU.

Generating Sentences from a Continuous Space

- Computer ScienceCoNLL
- 2016

This work introduces and study an RNN-based variational autoencoder generative model that incorporates distributed latent representations of entire sentences that allows it to explicitly model holistic properties of sentences such as style, topic, and high-level syntactic features.

Neural Discrete Representation Learning

- Computer ScienceNIPS
- 2017

Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

- Computer ScienceEMNLP
- 2014

Qualitatively, the proposed RNN Encoder‐Decoder model learns a semantically and syntactically meaningful representation of linguistic phrases.

Improved Variational Autoencoders for Text Modeling using Dilated Convolutions

- Computer ScienceICML
- 2017

It is shown that with the right decoder, VAE can outperform LSTM language models, and perplexity gains are demonstrated on two datasets, representing the first positive experimental result on the use VAE for generative modeling of text.

Compressing Word Embeddings via Deep Compositional Code Learning

- Computer ScienceICLR
- 2018

This work proposes to directly learn the discrete codes in an end-to-end neural network by applying the Gumbel-softmax trick, and achieves 98% in a sentiment analysis task and 94% ~ 99% in machine translation tasks without performance loss.