EA-CG: An Approximate Second-Order Method for Training Fully-Connected Neural Networks

  title={EA-CG: An Approximate Second-Order Method for Training Fully-Connected Neural Networks},
  author={Sheng-Wei Chen and Chun-Nan Chou and Edward Y. Chang},
For training fully-connected neural networks (FCNNs), we propose a practical approximate second-order method including: 1) an approximation of the Hessian matrix and 2) a conjugate gradient (CG) based method. Our proposed approximate Hessian matrix is memory-efficient and can be applied to any FCNNs where the activation and criterion functions are twice differentiable. We devise a CG-based method incorporating one-rank approximation to derive Newton directions for training FCNNs, which… 
Tractable structured natural gradient descent using local parameterizations
This work generalizes the exponential natural evolutionary strategy, recovers existing Newton-like algorithms, yields new structured second-order algorithms, and gives new algorithms to learn covariances of Gaussian and Wishart-based distributions.
Deep Residual Partitioning
This work introduces residual partitioning, a novel second-order optimization method for training neural nets that converges to a competitive or better solution on several machine learning tasks.
Laplace Approximation for Uncertainty Estimation of Deep Neural Networks
The most popular deep neural network architectures are compared based on their compliance to uncertainty estimation by Laplace approximation, assessing empirically the methods potentials and deficiencies as well as its applicability to large models and datasets while working towards an understanding how architectural choices correlate with the quality of obtained uncertainty estimates.


Block-diagonal Hessian-free Optimization for Training Neural Networks
Experiments on deep autoencoders, deep convolutional networks, and multilayer LSTMs demonstrate better convergence and generalization compared to the original Hessian-free approach and the Adam method.
Optimizing Neural Networks with Kronecker-factored Approximate Curvature
K-FAC is an efficient method for approximating natural gradient descent in neural networks which is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse.
Practical Gauss-Newton Optimisation for Deep Learning
A side result of this work is that for piecewise linear transfer functions, the net- work objective function can have no differ- entiable local maxima, which may partially explain why such transfer functions facilitate effective optimisation.
A Kronecker-factored approximate Fisher matrix for convolution layers
Second-order optimization methods such as natural gradient descent have the potential to speed up training of neural networks by correcting for the curvature of the loss function. Unfortunately, the
Deep learning via Hessian-free optimization
A 2nd-order optimization method based on the "Hessian-free" approach is developed, and applied to training deep auto-encoders, and results superior to those reported by Hinton & Salakhutdinov (2006) are obtained.
Fast Exact Multiplication by the Hessian
This work derives a technique that directly calculates Hv, where v is an arbitrary vector, and shows that this technique can be used at the heart of many iterative techniques for computing various properties of H, obviating any need to calculate the full Hessian.
Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent
We propose a generic method for iteratively approximating various second-order gradient steps-Newton, Gauss-Newton, Levenberg-Marquardt, and natural gradient-in linear time per iteration, using
Understanding the difficulty of training deep feedforward neural networks
The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.
Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons
An adaptive method of directly obtaining the inverse of the Fisher information matrix is proposed and it generalizes the adaptive Gauss-Newton algorithms and provides a solid theoretical justification of them.