• Publications
  • Influence
Adam: A Method for Stochastic Optimization
We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments. The method is straightforward toExpand
  • 46,811
  • 8373
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
Inspired by recent work in machine translation and object detection, we introduce an attention based model that automatically learns to describe the content of images. We describe how we can trainExpand
  • 4,935
  • 492
Do Deep Nets Really Need to be Deep?
Currently, deep neural networks are the state of the art on problems such as speech recognition and computer vision. In this paper we empirically demonstrate that shallow feed-forward nets can learnExpand
  • 1,010
  • 82
Layer Normalization
Training state-of-the-art, deep neural networks is computationally expensive. One way to reduce the training time is to normalize the activities of the neurons. A recently introduced technique calledExpand
  • 1,130
  • 72
Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation
In this work, we propose to apply trust region optimization to deep reinforcement learning using a recently proposed Kronecker-factored approximation to the curvature. We extend the framework ofExpand
  • 277
  • 47
Multiple Object Recognition with Visual Attention
We present an attention-based model for recognizing multiple objects in images. The proposed model is a deep recurrent neural network trained with reinforcement learning to attend to the mostExpand
  • 657
  • 36
Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning
The ability to act in multiple environments and transfer previous knowledge to new situations can be considered a critical aspect of any intelligent agent. Towards this goal, we define a novel methodExpand
  • 293
  • 33
Adaptive dropout for training deep neural networks
Recently, it was shown that deep neural networks can perform very well if the activities of hidden units are regularized during learning, e.g, by randomly dropping out 50% of their activities. WeExpand
  • 197
  • 17
Lookahead Optimizer: k steps forward, 1 step back
The vast majority of successful deep neural networks are trained using variants of stochastic gradient descent (SGD) algorithms. Recent attempts to improve SGD can be broadly categorized into twoExpand
  • 90
  • 17
Using Fast Weights to Attend to the Recent Past
Until recently, research on artificial neural networks was largely restricted to systems with only two types of variable: Neural activities that represent the current or recent input and weights thatExpand
  • 106
  • 16