• Publications
  • Influence
Using the Output Embedding to Improve Language Models
TLDR
The topmost weight matrix of neural network language models is studied and it is shown that this matrix constitutes a valid word embedding and a new method of regularizing the output embedding is offered. Expand
Shortformer: Better Language Modeling using Shorter Inputs
TLDR
This work shows that initially training the model on short subsequences, before moving on to longer ones, both reduces overall training time and gives a large improvement in perplexity and improves perplexity on WikiText-103, without adding any parameters. Expand
Language Generation with Recurrent Generative Adversarial Networks without Pre-training
TLDR
It is shown that recurrent neural networks can be trained to generate text with GANs from scratch by slowly teaching the model to generate sequences of increasing and variable length, which vastly improves the quality of generated sequences compared to a convolutional baseline. Expand
Improving Transformer Models by Reordering their Sublayers
TLDR
This work proposes a new transformer pattern that adheres to this property, the sandwich transformer, and shows that it improves perplexity on multiple word-level and character-level language modeling benchmarks, at no cost in parameters, memory, or training time. Expand
Partially Shuffling the Training Data to Improve Language Models
TLDR
This paper presents a method that partially shuffles the training data between epochs, which makes each batch random, while keeping most sentence ordering intact, and achieves new state of the art results on word-level language modeling on both the Penn Treebank and WikiText-2 datasets. Expand
You May Not Need Attention
TLDR
A recurrent neural translation model that does not use attention and does not have a separate encoder and decoder is introduced, and it performs on par with the standard attention-based model of Bahdanau et al. (2014), and better on long sentences. Expand
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
TLDR
A simple and efficient method, Attention with Linear Biases (ALiBi), that allows for extrapolation and inductive bias towards recency allows it to outperform multiple strong position methods on the WikiText-103 benchmark. Expand