• Publications
  • Influence
Regularizing and Optimizing LSTM Language Models
This paper proposes the weight-dropped LSTM which uses DropConnect on hidden-to-hidden weights as a form of recurrent regularization and introduces NT-ASGD, a variant of the averaged stochastic gradient method, wherein the averaging trigger is determined using a non-monotonic condition as opposed to being tuned by the user.
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.
CTRL: A Conditional Transformer Language Model for Controllable Generation
CTRL is released, a 1.63 billion-parameter conditional transformer language model, trained to condition on control codes that govern style, content, and task-specific behavior, providing more explicit control over text generation.
The Natural Language Decathlon: Multitask Learning as Question Answering
Presented on August 28, 2018 at 12:15 p.m. in the Pettit Microelectronics Research Center, Room 102 A/B.
Neural Text Summarization: A Critical Evaluation
This work critically evaluate key ingredients of the current research setup: datasets, evaluation metrics, and models, and highlights three primary shortcomings: automatically collected datasets leave the task underconstrained and may contain noise detrimental to training and evaluation.
Improving Generalization Performance by Switching from Adam to SGD
SWATS is a hybrid strategy that begins training with an adaptive method and switches to SGD when appropriate and is capable of closing the generalization gap between SGD and Adam on a majority of the tasks.
An Analysis of Neural Language Modeling at Multiple Scales
This work takes existing state-of-the-art word level language models based on LSTMs and QRNNs and extend them to both larger vocabularies as well as character-level granularity, achieving state- of- the-art results on character- level and word-level datasets.
GeDi: Generative Discriminator Guided Sequence Generation
GeDi is proposed as an efficient method for using smaller LMs as generative discriminators to guide generation from large LMs to make them safer and more controllable, and is found that GeDi gives stronger controllability than the state of the art method while also achieving generation speeds more than 30 times faster.
Coarse-grain Fine-grain Coattention Network for Multi-evidence Question Answering
The Coarse-grain Fine-grain Coattention Network (CFC), a new question answering model that combines information from evidence across multiple documents that obtains a new state-of-the-art result on the Qangaroo WikiHop multi-evidence question answering task.
Balancing Communication and Computation in Distributed Optimization
This paper proposes an adaptive cost framework that adjusts the cost measure depending on the features of various applications, and presents a flexible algorithmic framework, where communication and computation steps are explicitly decomposed to enable algorithm customization for various applications.