• Publications
  • Influence
Attention is All you Need
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an encoder-decoder configuration. The best performing models also connect the encoder andExpand
  • 11,724
  • 2979
  • PDF
Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks
Recurrent Neural Networks can be trained to produce sequences of tokens given some input, as exemplified by recent results in machine translation and image captioning. The current approach toExpand
  • 927
  • 105
  • PDF
Exploring the Limits of Language Modeling
In this work we explore recent advances in Recurrent Neural Networks for large scale Language Modeling, a task central to language understanding. We extend current models to deal with two keyExpand
  • 740
  • 88
  • PDF
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
Transfer learning, where a model is first pre-trained on a data-rich task before being fine-tuned on a downstream task, has emerged as a powerful technique in natural language processing (NLP). TheExpand
  • 489
  • 80
  • PDF
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer
The capacity of a neural network to absorb information is limited by its number of parameters. Conditional computation, where parts of the network are active on a per-example basis, has been proposedExpand
  • 493
  • 38
  • PDF
Generating Wikipedia by Summarizing Long Sequences
We show that generating English Wikipedia articles can be approached as a multi- document summarization of source documents. We use extractive summarization to coarsely identify salient informationExpand
  • 235
  • 36
  • PDF
End-to-end text-dependent speaker verification
In this paper we present a data-driven, integrated approach to speaker verification, which maps a test utterance and a few reference utterances directly to a single score for verification and jointlyExpand
  • 313
  • 32
  • PDF
Tensor2Tensor for Neural Machine Translation
Tensor2Tensor is a library for deep learning models that is well-suited for neural machine translation and includes the reference implementation of the state-of-the-art Transformer model.
  • 242
  • 25
  • PDF
Music Transformer: Generating Music with Long-Term Structure
Music relies heavily on repetition to build structure and meaning. Self-reference occurs on multiple timescales, from motifs to phrases to reusing of entire sections of music, such as in pieces withExpand
  • 79
  • 16
Swivel: Improving Embeddings by Noticing What's Missing
We present Submatrix-wise Vector Embedding Learner (Swivel), a method for generating low-dimensional feature embeddings from a feature co-occurrence matrix. Swivel performs approximate factorizationExpand
  • 60
  • 15
  • PDF