• Publications
  • Influence
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
Relational inductive biases, deep learning, and graph networks
It is argued that combinatorial generalization must be a top priority for AI to achieve human-like abilities, and that structured representations and computations are key to realizing this objective. Expand
Self-Attention with Relative Position Representations
This work presents an alternative approach, extending the self-attention mechanism to efficiently consider representations of the relative positions, or distances between sequence elements, on the WMT 2014 English-to-German and English- to-French translation tasks. Expand
Learning Whom to Trust with MACE
MACE (Multi-Annotator Competence Estimation) learns in an unsupervised fashion to identify which annotators are trustworthy and predict the correct underlying labels, and shows considerable improvements over standard baselines, both for predicted label accuracy and trustworthiness estimates. Expand
Decoding with Large-Scale Neural Language Models Improves Translation
This work develops a new model that combines the neural probabilistic language model of Bengio et al., rectified linear units, and noise-contrastive estimation, and incorporates it into a machine translation system both by reranking k-best lists and by direct integration into the decoder. Expand
Tensor2Tensor for Neural Machine Translation
Tensor2Tensor is a library for deep learning models that is well-suited for neural machine translation and includes the reference implementation of the state-of-the-art Transformer model.
Attention Augmented Convolutional Networks
It is found that Attention Augmentation leads to consistent improvements in image classification on ImageNet and object detection on COCO across many different models and scales, including ResNets and a state-of-the art mobile constrained network, while keeping the number of parameters similar. Expand
Music Transformer: Generating Music with Long-Term Structure
It is demonstrated that a Transformer with the modified relative attention mechanism can generate minute-long compositions with compelling structure, generate continuations that coherently elaborate on a given motif, and in a seq2seq setup generate accompaniments conditioned on melodies. Expand
Image Transformer
This work generalizes a recently proposed model architecture based on self-attention, the Transformer, to a sequence modeling formulation of image generation with a tractable likelihood, and significantly increases the size of images the model can process in practice, despite maintaining significantly larger receptive fields per layer than typical convolutional neural networks. Expand
Stand-Alone Self-Attention in Vision Models
The results establish that stand-alone self-attention is an important addition to the vision practitioner's toolbox and is especially impactful when used in later layers. Expand