# Sequence to Sequence Learning with Neural Networks

@inproceedings{Sutskever2014SequenceTS, title={Sequence to Sequence Learning with Neural Networks}, author={Ilya Sutskever and Oriol Vinyals and Quoc V. Le}, booktitle={NIPS}, year={2014} }

Deep Neural Networks (DNNs) are powerful models that have achieved excellent performance on difficult learning tasks. [...] Key Method Our method uses a multilayered Long Short-Term Memory (LSTM) to map the input sequence to a vector of a fixed dimensionality, and then another deep LSTM to decode the target sequence from the vector. Our main result is that on an English to French translation task from the WMT-14 dataset, the translations produced by the LSTM achieve a BLEU score of 34.8 on the entire test set… Expand

#### 13,763 Citations

A Hierarchical Neural Network for Sequence-to-Sequences Learning

- Computer Science
- ArXiv
- 2018

This paper presents a hierarchical deep neural network architecture to improve the quality of long sentences translation and can achieve superior results with higher BLEU (Bilingual Evaluation Understudy) scores, lower perplexity and better performance in imitating expression style and words usage than the traditional networks. Expand

Multi-task Sequence to Sequence Learning

- Computer Science, Mathematics
- ICLR
- 2016

The results show that training on a small amount of parsing and image caption data can improve the translation quality between English and German by up to 1.5 BLEU points over strong single-task baselines on the WMT benchmarks, and reveal interesting properties of the two unsupervised learning objectives, autoencoder and skip-thought, in the MTL context. Expand

Unsupervised Pretraining for Sequence to Sequence Learning

- Computer Science
- EMNLP
- 2017

This work presents a general unsupervised learning method to improve the accuracy of sequence to sequence (seq2seq) models by pretraining the weights of the encoder and decoder with the pretrained weights of two language models and then fine-tuned with labeled data. Expand

Von Mises-Fisher Loss for Training Sequence to Sequence Models with Continuous Outputs

- Computer Science, Mathematics
- ICLR
- 2019

This work proposes a general technique for replacing the softmax layer with a continuous embedding layer, and introduces a novel probabilistic loss, and a training and inference procedure in which it generates a probability distribution over pre-trained word embeddings, instead of a multinomial distribution over the vocabulary obtained via softmax. Expand

Sequence-to-Sequence Learning with Latent Neural Grammars

- Computer Science
- ArXiv
- 2021

This work develops a neural parameterization of the grammar which enables parameter sharing over the combinatorial space of derivation rules without the need for manual feature engineering, and applies it to a diagnostic language navigation task and to small-scale machine translation. Expand

Agreement on Target-Bidirectional LSTMs for Sequence-to-Sequence Learning

- Computer Science
- AAAI
- 2016

The proposed approach relies on the agreement between a pair of target-directional LSTMs, which generates more balanced targets, and develops two efficient approximate search methods for agreement that are empirically shown to be almost optimal in terms of sequence-level losses. Expand

The Roles of Language Models and Hierarchical Models in Neural Sequence-to-Sequence Prediction

- Computer Science
- EAMT
- 2020

It is shown how traditional symbolic statistical machine translation models can still improve neural machine translation while reducing the risk of common pathologies of NMT such as hallucinations and neologisms. Expand

Data Generation Using Sequence-to-Sequence

- Computer Science
- 2018 IEEE Recent Advances in Intelligent Computational Systems (RAICS)
- 2018

The results that are got prove the hypothesis that the process of generation of clean data can be validated objectively by evaluating the models alongside the efficiency of the system to generate data in each iteration. Expand

Recurrent Neural Network-Based Semantic Variational Autoencoder for Sequence-to-Sequence Learning

- Computer Science, Mathematics
- Inf. Sci.
- 2019

Experimental results of three natural language tasks confirm that the proposed RNN--SVAE yields higher performance than two benchmark models, and the mean and standard deviation of the continuous semantic space are learned to take advantage of the variational method. Expand

Semi-supervised Sequence Learning

- Computer Science
- NIPS
- 2015

Two approaches to use unlabeled data to improve Sequence Learning with recurrent networks are presented and it is found that long short term memory recurrent networks after pretrained with the two approaches become more stable to train and generalize better. Expand

#### References

SHOWING 1-10 OF 56 REFERENCES

Sequence Transduction with Recurrent Neural Networks

- Computer Science, Mathematics
- ArXiv
- 2012

This paper introduces an end-to-end, probabilistic sequence transduction system, based entirely on RNNs, that is in principle able to transform any input sequence into any finite, discrete output sequence. Expand

LSTM Neural Networks for Language Modeling

- Computer Science
- INTERSPEECH
- 2012

This work analyzes the Long Short-Term Memory neural network architecture on an English and a large French language modeling task and gains considerable improvements in WER on top of a state-of-the-art speech recognition system. Expand

Neural Machine Translation by Jointly Learning to Align and Translate

- Computer Science, Mathematics
- ICLR
- 2015

It is conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder-decoder architecture, and it is proposed to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. Expand

A Neural Probabilistic Language Model

- Computer Science
- J. Mach. Learn. Res.
- 2000

This work proposes to fight the curse of dimensionality by learning a distributed representation for words which allows each training sentence to inform the model about an exponential number of semantically neighboring sentences. Expand

Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks

- Computer Science
- ICML
- 2006

This paper presents a novel method for training RNNs to label unsegmented sequences directly, thereby solving both problems of sequence learning and post-processing. Expand

Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition

- Computer Science
- IEEE Transactions on Audio, Speech, and Language Processing
- 2012

A pre-trained deep neural network hidden Markov model (DNN-HMM) hybrid architecture that trains the DNN to produce a distribution over senones (tied triphone states) as its output that can significantly outperform the conventional context-dependent Gaussian mixture model (GMM)-HMMs. Expand

Joint Language and Translation Modeling with Recurrent Neural Networks

- Computer Science
- EMNLP
- 2013

This work presents a joint language and translation model based on a recurrent neural network which predicts target words based on an unbounded history of both source and target words which shows competitive accuracy compared to the traditional channel model features. Expand

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation

- Computer Science, Mathematics
- EMNLP
- 2014

Qualitatively, the proposed RNN Encoder‐Decoder model learns a semantically and syntactically meaningful representation of linguistic phrases. Expand

Learning long-term dependencies with gradient descent is difficult

- Computer Science, Medicine
- IEEE Trans. Neural Networks
- 1994

This work shows why gradient based learning algorithms face an increasingly difficult problem as the duration of the dependencies to be captured increases, and exposes a trade-off between efficient learning by gradient descent and latching on information for long periods. Expand

Statistical Language Models Based on Neural Networks

- Computer Science
- 2012

Although these models are computationally more expensive than N -gram models, with the presented techniques it is possible to apply them to state-of-the-art systems efficiently and achieves the best published performance on well-known Penn Treebank setup. Expand