# Accelerating Hessian-free optimization for Deep Neural Networks by implicit preconditioning and sampling

@article{Sainath2013AcceleratingHO,
title={Accelerating Hessian-free optimization for Deep Neural Networks by implicit preconditioning and sampling},
author={Tara N. Sainath and L. Horesh and Brian Kingsbury and Aleksandr Y. Aravkin and Bhuvana Ramabhadran},
journal={2013 IEEE Workshop on Automatic Speech Recognition and Understanding},
year={2013},
pages={303-308}
}
• Published 5 September 2013
• Computer Science
• 2013 IEEE Workshop on Automatic Speech Recognition and Understanding
Hessian-free training has become a popular parallel second order optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS…

## Figures and Tables from this paper

• Computer Science
• 2016
It is shown that the quasi-Newton l-BFGS preconditioning scheme with the pseudo diagonal Gauss- newton Hessian as initial guess shows the best performances in accelerating the HF GaussNewton FWI.
• Computer Science
ICML
• 2016
This paper explores an unconventional training method that uses alternating direction methods and Bregman iteration to train networks without gradient descent steps, and exhibits strong scaling in the distributed setting, yielding linear speedups even when split over thousands of cores.
• Computer Science, Geology
• 2017
This research develops and compares different preconditioning schemes for the CG algorithm to accelerate the HF Gauss-Newton (GN) method, and uses a new pseudo diagonal GN Hessian as a preconditionser, making use of the reciprocal property of Green’s function.
• Computer Science
• 2017
This work extends the K-FAC method to handle RNNs by introducing a novel approximation to the FIM for FIM, and demonstrates that this method significantly outperforms general purpose state-of-the-art optimizers like SGD with momentum and Adam on several challenging RNN training tasks.
• Computer Science
Neural Networks: Tricks of the Trade
• 2012
This chapter describes the basic HF approach, and examines well-known performance-improving techniques such as preconditioning which have been beneficial for neural network training and others of a more heuristic nature which are harder to justify, but which have found to work well in practice.
• Computer Science
INTERSPEECH
• 2014
This work explores using the 2nd order Hessian-free (HF) algorithm for DNN training with BG/Q, for both cross-entropy and sequence training of DNNs, using a Blue Gene/Q system, which has thousands of processors and excellent interprocessor communication.
• Computer Science
IEEE/ACM Transactions on Audio, Speech, and Language Processing
• 2021
The efficacy of the proposed Bayesian techniques is further demonstrated in a comparison against the state-of-the-art performance obtained on the same task using the most recent hybrid and end-to-end systems reported in the literature.
This thesis concerns seismic full-waveform inversion (FWI) techniques for estimating subsurface properties. FWI approaches promise to provide high-resolution estimates of subsurface parameters using
• Computer Science
• 2017
Experimental results on the common two image datasets demonstrate that the proposed convolutional denoising sparse autoencoder approach is effective in image classification and that none of these three components: local contrast normalization, SPP fused with center-prior, and vector normalization can be excluded from the proposed approach.

## References

SHOWING 1-10 OF 19 REFERENCES

A 2nd-order optimization method based on the "Hessian-free" approach is developed, and applied to training deep auto-encoders, and results superior to those reported by Hinton & Salakhutdinov (2006) are obtained.
This work derives a technique that directly calculates Hv, where v is an arbitrary vector, and shows that this technique can be used at the heart of many iterative techniques for computing various properties of H, obviating any need to calculate the full Hessian.
An update formula which generates matrices using information from the last m iterations, where m is any number supplied by the user, and the BFGS method is considered to be the most efficient.
• Computer Science
SIAM J. Optim.
• 2000
A preconditioner for the conjugate gradient method that is designed for solving systems of equations Ax=bi with different right-hand-side vectors or for solving a sequence of slowly varying systems Ak x = bk is proposed.
• Computer Science
Math. Program.
• 2012
A criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient, and establishes an O(1/\epsilon) complexity bound on the total cost of a gradient method.
• Computer Science
NIPS
• 2012
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.
We propose a generic method for iteratively approximating various second-order gradient steps-Newton, Gauss-Newton, Levenberg-Marquardt, and natural gradient-in linear time per iteration, using
• Computer Science
SIAM J. Sci. Comput.
• 2012
Rate-of-convergence analysis shows that by controlling the sample size in an incremental gradient algorithm, it is possible to maintain the steady convergence rates of full-gradient methods.
The Conjugate Gradient Method is the most prominent iterative method for solving sparse systems of linear equations. Unfortunately, many textbook treatments of the topic are written so that even
The preconditioned conjugate gradient (PCG) method is an effective means for solving systems of linear equations where the coefficient matrix is symmetric and positive definite. The incomplete \$LDL^t