• Corpus ID: 235422700

NG+ : A Multi-Step Matrix-Product Natural Gradient Method for Deep Learning

  title={NG+ : A Multi-Step Matrix-Product Natural Gradient Method for Deep Learning},
  author={Minghan Yang and Dong Xu and Qiwen Cui and Zaiwen Wen and Pengxiang Xu},
In this paper, a novel second-order method called NG+ is proposed. By following the rule “the shape of the gradient equals the shape of the parameter", we define a generalized fisher information matrix (GFIM) using the products of gradients in the matrix form rather than the traditional vectorization. Then, our generalized natural gradient direction is simply the inverse of the GFIM multiplies the gradient in the matrix form. Moreover, the GFIM and its inverse keeps the same for multiple steps… 
1 Citations

Figures and Tables from this paper

Riemannian Natural Gradient Methods

A novel Riemannian natural gradient method, which can be viewed as a natural extension of thenatural gradient method from the Euclidean setting to the manifold setting, and it is proved that the RiemANNian Jacobian stability condition will be satisfied by a two-layer fully connected neural network with batch normalization with high probability, provided that the width of the network is sufficiently large.



Sketchy Empirical Natural Gradient Methods for Deep Learning

An efficient sketchy empirical natural gradient method for large-scale finite-sum optimization problems from deep learning and is quite competitive to the state-of-the-art methods such as SGD and KFAC.

Kronecker-factored Quasi-Newton Methods for Convolutional Neural Networks

KF-QNCNN is proposed, a new Kronecker-factored quasi-Newton method for training convolutional neural networks (CNNs), where the Hessian is approximated by a layer-wise block diagonal matrix and each layer’s diagonal block is further approximating by a Kr onecker product corresponding to the structure of the Hessians restricted to that layer.

Practical Quasi-Newton Methods for Training Deep Neural Networks

This work proposes a new damping approach to keep the upper as well as the lower bounds of the BFGS and L-BFGS approximations bounded, and outperformed or performed comparably to KFAC and state-of-the-art first-order stochastic methods.

Optimizing Neural Networks with Kronecker-factored Approximate Curvature

K-FAC is an efficient method for approximating natural gradient descent in neural networks which is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse.

Enhance Curvature Information by Structured Stochastic Quasi-Newton Methods

Numerical results on logistic regression, deep autoencoder networks and deep convolutional neural networks show that the proposed structured stochastic quasi-Newton method is quite competitive to the state-of-the-art methods.

Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.

SchNet: A continuous-filter convolutional neural network for modeling quantum interactions

This work proposes to use continuous-filter convolutional layers to be able to model local correlations without requiring the data to lie on a grid, and obtains a joint model for the total energy and interatomic forces that follows fundamental quantum-chemical principles.

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.