• Corpus ID: 13756489

Attention is All you Need

  title={Attention is All you Need},
  author={Ashish Vaswani and Noam M. Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin},
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. [] Key Result We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Figures and Tables from this paper

Weighted Transformer Network for Machine Translation

Weighted Transformer is proposed, a Transformer with modified attention layers, that not only outperforms the baseline network in BLEU score but also converges 15-40% faster.

How Much Attention Do You Need? A Granular Analysis of Neural Machine Translation Architectures

This work takes a fine-grained look at the different architectures for NMT and introduces an Architecture Definition Language (ADL) allowing for a flexible combination of common building blocks and shows that self-attention is much more important on the encoder side than on the decoder side.

A Simple but Effective Way to Improve the Performance of RNN-Based Encoder in Neural Machine Translation Task

A new architecture to proficiently mines the ability of attention mechanism and stacked recurrent neural networks for neural machine translation (NMT) tasks is proposed.

Joint Source-Target Self Attention with Locality Constraints

This paper's simplified architecture consists in the decoder part of a transformer model, based on self-attention, but with locality constraints applied on the attention receptive field, which achieves a new state of the art of 35.7 BLEU on IWSLT'14 German-English.

Attention Transformer Model for Translation of Similar Languages

This paper illustrates the approach to the shared task on similar language translation in the fifth conference on machine translation (WMT-20) with a recurrence based layered encoder-decoder model with the Transformer model that enjoys the benefits of both Recurrent Attention and Transformer.

Accelerating Neural Transformer via an Average Attention Network

The proposed average attention network is applied on the decoder part of the neural Transformer to replace the original target-side self-attention model and enables the neuralTransformer to decode sentences over four times faster than its original version with almost no loss in training time and translation performance.

Temporal Convolutional Attention-based Network For Sequence Modeling

This work proposes an exploratory architecture referred to Temporal Convolutional Attention-based Network (TCAN) which combines temporal convolutional network and attention mechanism and improves the state-of-the-art results of bpc/perplexity.


This work argues that certain dependencies among words could be learned better through an intermediate context than directly modeling word-word dependencies, and proposes a new way of learning dependencies through a context in multi-head using convolution.

Self-Attention and Dynamic Convolution Hybrid Model for Neural Machine Translation

A hybrid model is proposed that combines a self-attention module and a dynamic convolution module by taking a weighted sum of their outputs where the weights can be dynamically learned by the model during training.

An Analysis of Encoder Representations in Transformer-Based Machine Translation

This work investigates the information that is learned by the attention mechanism in Transformer models with different translation quality, and sheds light on the relative strengths and weaknesses of the various encoder representations.



Deep Recurrent Models with Fast-Forward Connections for Neural Machine Translation

This work introduces a new type of linear connections, named fast-forward connections, based on deep Long Short-Term Memory (LSTM) networks, and an interleaved bi-directional architecture for stacking the LSTM layers, and achieves state-of-the-art performance and outperforms the best conventional model by 0.7 BLEU points.

Sequence to Sequence Learning with Neural Networks

This paper presents a general end-to-end approach to sequence learning that makes minimal assumptions on the sequence structure, and finds that reversing the order of the words in all source sentences improved the LSTM's performance markedly, because doing so introduced many short term dependencies between the source and the target sentence which made the optimization problem easier.

Neural Machine Translation in Linear Time

The ByteNet decoder attains state-of-the-art performance on character-level language modelling and outperforms the previous best results obtained with recurrent networks and the latent alignment structure contained in the representations reflects the expected alignment between the tokens.

A Deep Reinforced Model for Abstractive Summarization

A neural network model with a novel intra-attention that attends over the input and continuously generated output separately, and a new training method that combines standard supervised word prediction and reinforcement learning (RL) that produces higher quality summaries.

Can Active Memory Replace Attention?

An extended model of active memory is proposed that matches existing attention models on neural machine translation and generalizes better to longer sentences and discusses when active memory brings most benefits and where attention can be a better choice.

End-To-End Memory Networks

A neural network with a recurrent attention model over a possibly large external memory that is trained end-to-end, and hence requires significantly less supervision during training, making it more generally applicable in realistic settings.

Structured Attention Networks

This work shows that structured attention networks are simple extensions of the basic attention procedure, and that they allow for extending attention beyond the standard soft-selection approach, such as attending to partial segmentations or to subtrees.

Multi-task Sequence to Sequence Learning

The results show that training on a small amount of parsing and image caption data can improve the translation quality between English and German by up to 1.5 BLEU points over strong single-task baselines on the WMT benchmarks, and reveal interesting properties of the two unsupervised learning objectives, autoencoder and skip-thought, in the MTL context.

Convolutional Sequence to Sequence Learning

This work introduces an architecture based entirely on convolutional neural networks, which outperform the accuracy of the deep LSTM setup of Wu et al. (2016) on both WMT'14 English-German and WMT-French translation at an order of magnitude faster speed, both on GPU and CPU.

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

This work introduces a Sparsely-Gated Mixture-of-Experts layer (MoE), consisting of up to thousands of feed-forward sub-networks, and applies the MoE to the tasks of language modeling and machine translation, where model capacity is critical for absorbing the vast quantities of knowledge available in the training corpora.