• Publications
  • Influence
On the importance of initialization and momentum in deep learning
TLDR
We show that when stochastic gradient descent with momentum uses a well-designed random initialization and a particular type of slowly increasing schedule for the momentum parameter, it can train both DNNs and RNNs (on datasets with long-term dependencies) to levels of performance that were previously achievable only with Hessian-Free optimization. Expand
  • 2,670
  • 293
  • PDF
Deep learning via Hessian-free optimization
TLDR
We develop a 2nd-order optimization method based on the "Hessian-free" approach, and apply it to training deep auto-encoders. Expand
  • 720
  • 94
  • PDF
Optimizing Neural Networks with Kronecker-factored Approximate Curvature
TLDR
We propose an efficient method for approximating natural gradient descent in neural networks which we call Kronecker-Factored Approximate Curvature (K-FAC). Expand
  • 380
  • 79
  • PDF
Generating Text with Recurrent Neural Networks
TLDR
We demonstrate the power of RNNs trained with the new Hessian-Free optimizer (HF) by applying them to character-level language modeling tasks. Expand
  • 1,046
  • 71
  • PDF
Learning Recurrent Neural Networks with Hessian-Free Optimization
TLDR
In this work we resolve the long-outstanding problem of how to effectively train recurrent neural networks (RNNs) on complex and difficult sequence modeling problems which may contain long-term data dependencies. Expand
  • 535
  • 52
  • PDF
The Mechanics of n-Player Differentiable Games
TLDR
Symplectic Gradient Adjustment (SGA), a new algorithm for finding stable fixed points in general games. Expand
  • 128
  • 33
  • PDF
New Insights and Perspectives on the Natural Gradient Method
  • J. Martens
  • Computer Science, Mathematics
  • J. Mach. Learn. Res.
  • 3 December 2014
TLDR
Natural gradient descent is an optimization method traditionally motivated from the perspective of information geometry, and works well for many applications as an alternative to stochastic gradient descent. Expand
  • 170
  • 27
  • PDF
Adding Gradient Noise Improves Learning for Very Deep Networks
TLDR
In this paper, we explore the low-overhead and easy-to-implement optimization technique of adding annealed Gaussian noise to the gradient, which we find surprisingly effective when training these very deep architectures. Expand
  • 298
  • 24
  • PDF
Training Deep and Recurrent Networks with Hessian-Free Optimization
TLDR
In this chapter we will first describe the basic HF approach, and then examine well-known performance-improving techniques such as preconditioning which we have found to be beneficial for neural network training, as well as others of a more heuristic nature which are harder to justify, but work well in practice. Expand
  • 179
  • 17
  • PDF
Adversarial Robustness through Local Linearization
TLDR
We introduce a novel regularizer that encourages the loss to behave linearly in the vicinity of the training data, thereby penalizing gradient obfuscation while encouraging robustness. Expand
  • 81
  • 12
  • PDF