• Publications
  • Influence
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems
TLDR
The TensorFlow interface and an implementation of that interface that is built at Google are described, which has been used for conducting research and for deploying machine learning systems into production across more than a dozen areas of computer science and other fields. Expand
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
TLDR
GNMT, Google's Neural Machine Translation system, is presented, which attempts to address many of the weaknesses of conventional phrase-based translation systems and provides a good balance between the flexibility of "character"-delimited models and the efficiency of "word"-delicited models. Expand
Grammar as a Foreign Language
TLDR
The domain agnostic attention-enhanced sequence-to-sequence model achieves state-of-the-art results on the most widely used syntactic constituency parsing dataset, when trained on a large synthetic corpus that was annotated using existing parsers. Expand
Regularizing Neural Networks by Penalizing Confident Output Distributions
TLDR
It is found that both label smoothing and the confidence penalty improve state-of-the-art models across benchmarks without modifying existing hyperparameters, suggesting the wide applicability of these regularizers. Expand
Multi-task Sequence to Sequence Learning
TLDR
The results show that training on a small amount of parsing and image caption data can improve the translation quality between English and German by up to 1.5 BLEU points over strong single-task baselines on the WMT benchmarks, and reveal interesting properties of the two unsupervised learning objectives, autoencoder and skip-thought, in the MTL context. Expand
Generating Wikipedia by Summarizing Long Sequences
TLDR
It is shown that generating English Wikipedia articles can be approached as a multi- document summarization of source documents and a neural abstractive model is introduced, which can generate fluent, coherent multi-sentence paragraphs and even whole Wikipedia articles. Expand
Reformer: The Efficient Transformer
TLDR
This work replaces dot-product attention by one that uses locality-sensitive hashing and uses reversible residual layers instead of the standard residuals, which allows storing activations only once in the training process instead of several times, making the model much more memory-efficient and much faster on long sequences. Expand
Sentence Compression by Deletion with LSTMs
TLDR
It is demonstrated that even the most basic version of the LSTM system, given no syntactic information or desired compression length, performs surprisingly well: around 30% of the compressions from a large test set could be regenerated. Expand
Learning to Remember Rare Events
TLDR
A large-scale life-long memory module for use in deep learning that remembers training examples shown many thousands of steps in the past and it can successfully generalize from them and demonstrate, for the first time, life- long one-shot learning in recurrent neural networks on a large- scale machine translation task. Expand
...
1
2
3
4
...