• Publications
  • Influence
Regularizing and Optimizing LSTM Language Models
TLDR
We propose the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization. Expand
  • 687
  • 162
  • PDF
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
TLDR
We present numerical evidence that large-batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. Expand
  • 1,196
  • 130
  • PDF
CTRL: A Conditional Transformer Language Model for Controllable Generation
TLDR
We release CTRL, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior. Expand
  • 233
  • 49
  • PDF
The Natural Language Decathlon: Multitask Learning as Question Answering
TLDR
We introduce the Natural Language Decathlon (decaNLP), a new benchmark for measuring the performance of NLP models across ten tasks that appear disparate until unified as question answering. Expand
  • 257
  • 30
  • PDF
An Analysis of Neural Language Modeling at Multiple Scales
TLDR
We extend existing state-of-the-art word level language models based on LSTMs and QRNNs and extend them to both larger vocabularies as well as character-level granularity. Expand
  • 132
  • 25
  • PDF
Neural Text Summarization: A Critical Evaluation
TLDR
We critically evaluate key ingredients of the current research setup: datasets, evaluation metrics, and models, and highlight three primary shortcomings: 1) automatically collected datasets leave the task underconstrained and may contain noise detrimental to training and evaluation. Expand
  • 77
  • 22
  • PDF
Improving Generalization Performance by Switching from Adam to SGD
TLDR
We investigate a hybrid strategy that begins training with an adaptive method and switches to SGD when appropriate. Expand
  • 217
  • 20
  • PDF
Coarse-grain Fine-grain Coattention Network for Multi-evidence Question Answering
TLDR
We propose the Coarse-grain Fine-grain Coattention Network (CFC), a new question answering model that combines information from evidence across multiple documents. Expand
  • 35
  • 6
  • PDF
Weighted Transformer Network for Machine Translation
TLDR
We propose Weighted Transformer, a Transformer with modified attention layers, that not only outperforms the baseline network in BLEU score but also converges 15-40% faster. Expand
  • 83
  • 5
  • PDF
Balancing Communication and Computation in Distributed Optimization
TLDR
We present a flexible algorithmic framework, where communication and computation steps are explicitly decomposed to enable algorithm customization for various applications. Expand
  • 42
  • 5
  • PDF
...
1
2
3
4
...