# Convolutional Sequence to Sequence Learning

@inproceedings{Gehring2017ConvolutionalST, title={Convolutional Sequence to Sequence Learning}, author={Jonas Gehring and Michael Auli and David Grangier and Denis Yarats and Yann Dauphin}, booktitle={International Conference on Machine Learning}, year={2017} }

The prevalent approach to sequence to sequence learning maps an input sequence to a variable length output sequence via recurrent neural networks. [] Key Method Our use of gated linear units eases gradient propagation and we equip each decoder layer with a separate attention module. We outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT'14 English-French translation at an order of magnitude faster speed, both on GPU and CPU.

## 2,714 Citations

### Pervasive Attention: 2D Convolutional Neural Networks for Sequence-to-Sequence Prediction

- Computer ScienceCoNLL
- 2018

This work proposes an alternative approach which instead relies on a single 2D convolutional neural network across both sequences, which outperforms state-of-the-art encoder-decoder systems, while being conceptually simpler and having fewer parameters.

### DENSELY CONNECTED RECURRENT NEURAL NET-

- Computer Science
- 2017

It is shown that in WMT-14 English-French translation task with a subset of 12M training data, it takes half of training time and model parameters to achieve similar BLEU as typical stacked LSTM models.

### Double Path Networks for Sequence to Sequence Learning

- Computer ScienceCOLING
- 2018

This work proposes Double Path Networks for Sequence to Sequence learning (DPN-S2S), which leverage the advantages of both models by using double path information fusion and can significantly improve the performance of sequence to sequence learning over state-of-the-art systems.

### An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling

- Computer ScienceArXiv
- 2018

A systematic evaluation of generic convolutional and recurrent architectures for sequence modeling concludes that the common association between sequence modeling and recurrent networks should be reconsidered, and convolutionals should be regarded as a natural starting point for sequence modeled tasks.

### Sequence Labeling With Deep Gated Dual Path CNN

- Computer ScienceIEEE/ACM Transactions on Audio, Speech, and Language Processing
- 2019

Experimental results on three sequence labeling tasks show that the proposed model can achieve competitive performance to the RNN-based state-of-the-art method while maintaining faster speed, even with up to 10 convolutional layers.

### Training RNNs as Fast as CNNs

- Computer ScienceEMNLP 2018
- 2017

The Simple Recurrent Unit architecture is proposed, a recurrent unit that simplifies the computation and exposes more parallelism, and is as fast as a convolutional layer and 5-10x faster than an optimized LSTM implementation.

### Classical Structured Prediction Losses for Sequence to Sequence Learning

- Computer ScienceNAACL
- 2018

A range of classical objective functions that have been widely used to train linear models for structured prediction and apply to neural sequence to sequence models are surveyed and show that these losses can perform surprisingly well by slightly outperforming beam search optimization in a like for like setup.

### Simple Recurrent Units for Highly Parallelizable Recurrence

- Computer ScienceEMNLP
- 2018

The Simple Recurrent Unit is proposed, a light recurrent unit that balances model capacity and scalability, designed to provide expressive recurrence, enable highly parallelized implementation, and comes with careful initialization to facilitate training of deep models.

### Convolutional Sequence Modeling Revisited

- Computer ScienceICLR
- 2018

It is argued that it may be time to (re)consider ConvNets as the default “go to” architecture for sequence modeling, and the potential “infinite memory” advantage that RNNs have over TCNs is largely absent in practice.

### Sequence-to-Sequence Speech Recognition with Time-Depth Separable Convolutions

- Computer ScienceINTERSPEECH
- 2019

We propose a fully convolutional sequence-to-sequence encoder architecture with a simple and efficient decoder. Our model improves WER on LibriSpeech while being an order of magnitude more efficient…

## References

SHOWING 1-10 OF 47 REFERENCES

### Sequence to Sequence Learning with Neural Networks

- Computer ScienceNIPS
- 2014

This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

### A Convolutional Encoder Model for Neural Machine Translation

- Computer ScienceACL
- 2017

A faster and simpler architecture based on a succession of convolutional layers that allows to encode the source sentence simultaneously compared to recurrent networks for which computation is constrained by temporal dependencies is presented.

### Language Modeling with Gated Convolutional Networks

- Computer ScienceICML
- 2017

A finite context approach through stacked convolutions, which can be more efficient since they allow parallelization over sequential tokens, is developed and is the first time a non-recurrent approach is competitive with strong recurrent models on these large scale language tasks.

### Encoding Source Language with Convolutional Neural Network for Machine Translation

- Computer ScienceACL
- 2015

A more systematic treatment by summarizing the relevant source information through a convolutional architecture guided by the target information, which can achieve significant improvements over the previous NNJM.

### End-To-End Memory Networks

- Computer ScienceNIPS
- 2015

A neural network with a recurrent attention model over a possibly large external memory that is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings.

### Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation

- Computer ScienceTACL
- 2016

This work introduces a new type of linear connections, named fast-forward connections, based on deep Long Short-Term Memory (LSTM) networks, and an interleaved bi-directional architecture for stacking the LSTM layers, and achieves state-of-the-art performance and outperforms the best conventional model by 0.7 BLEU points.

### Pixel Recurrent Neural Networks

- Computer ScienceICML
- 2016

A deep neural network is presented that sequentially predicts the pixels in an image along the two spatial dimensions and encodes the complete set of dependencies in the image to achieve log-likelihood scores on natural images that are considerably better than the previous state of the art.

### Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks

- Computer ScienceNIPS
- 2016

A reparameterization of the weight vectors in a neural network that decouples the length of those weight vectors from their direction is presented, improving the conditioning of the optimization problem and speeding up convergence of stochastic gradient descent.

### Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

- Computer ScienceEMNLP
- 2014

Qualitatively, the proposed RNN Encoder‐Decoder model learns a semantically and syntactically meaningful representation of linguistic phrases.

### Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

- Computer ScienceICLR
- 2017

This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.