Accelerating Hessian-free optimization for Deep Neural Networks by implicit preconditioning and sampling

@article{Sainath2013AcceleratingHO,
  title={Accelerating Hessian-free optimization for Deep Neural Networks by implicit preconditioning and sampling},
  author={Tara N. Sainath and L. Horesh and Brian Kingsbury and Aleksandr Y. Aravkin and Bhuvana Ramabhadran},
  journal={2013 IEEE Workshop on Automatic Speech Recognition and Understanding},
  year={2013},
  pages={303-308}
}
Hessian-free training has become a popular parallel second order optimization technique for Deep Neural Network training. This study aims at speeding up Hessian-free training, both by means of decreasing the amount of data used for training, as well as through reduction of the number of Krylov subspace solver iterations used for implicit estimation of the Hessian. In this paper, we develop an L-BFGS based preconditioning scheme that avoids the need to access the Hessian explicitly. Since L-BFGS… 

Figures and Tables from this paper

Accelerating Hessian-free Gauss-Newton full-waveform inversion via improved preconditioning strategies

It is shown that the quasi-Newton l-BFGS preconditioning scheme with the pseudo diagonal Gauss- newton Hessian as initial guess shows the best performances in accelerating the HF GaussNewton FWI.

Training Neural Networks Without Gradients: A Scalable ADMM Approach

This paper explores an unconventional training method that uses alternating direction methods and Bregman iteration to train networks without gradient descent steps, and exhibits strong scaling in the distributed setting, yielding linear speedups even when split over thousands of cores.

Accelerating Hessian-free Gauss-Newton full-waveform inversion via l-BFGS preconditioned conjugate-gradient algorithm

This research develops and compares different preconditioning schemes for the CG algorithm to accelerate the HF Gauss-Newton (GN) method, and uses a new pseudo diagonal GN Hessian as a preconditionser, making use of the reciprocal property of Green’s function.

KRONECKER-FACTORED CURVATURE APPROXIMA-

  • Computer Science
  • 2017
This work extends the K-FAC method to handle RNNs by introducing a novel approximation to the FIM for FIM, and demonstrates that this method significantly outperforms general purpose state-of-the-art optimizers like SGD with momentum and Adam on several challenging RNN training tasks.

Training Deep and Recurrent Networks with Hessian-Free Optimization

This chapter describes the basic HF approach, and examines well-known performance-improving techniques such as preconditioning which have been beneficial for neural network training and others of a more heuristic nature which are harder to justify, but which have found to work well in practice.

Parallel deep neural network training for LVCSR tasks using blue gene/Q

This work explores using the 2nd order Hessian-free (HF) algorithm for DNN training with BG/Q, for both cross-entropy and sequence training of DNNs, using a Blue Gene/Q system, which has thousands of processors and excellent interprocessor communication.

Bayesian Learning of LF-MMI Trained Time Delay Neural Networks for Speech Recognition

The efficacy of the proposed Bayesian techniques is further demonstrated in a comparison against the state-of-the-art performance obtained on the same task using the most recent hybrid and end-to-end systems reported in the literature.

Waveform Inversion for Estimating Subsurface Properties: Phase-encoding Strategies, Optimization Methods, Interparameter Tradeoffs Quantification and Reduction

This thesis concerns seismic full-waveform inversion (FWI) techniques for estimating subsurface properties. FWI approaches promise to provide high-resolution estimates of subsurface parameters using

Image Classification Based on Convolutional Denoising Sparse Autoencoder

Experimental results on the common two image datasets demonstrate that the proposed convolutional denoising sparse autoencoder approach is effective in image classification and that none of these three components: local contrast normalization, SPP fused with center-prior, and vector normalization can be excluded from the proposed approach.

References

SHOWING 1-10 OF 19 REFERENCES

Deep learning via Hessian-free optimization

A 2nd-order optimization method based on the "Hessian-free" approach is developed, and applied to training deep auto-encoders, and results superior to those reported by Hinton & Salakhutdinov (2006) are obtained.

Fast Exact Multiplication by the Hessian

This work derives a technique that directly calculates Hv, where v is an arbitrary vector, and shows that this technique can be used at the heart of many iterative techniques for computing various properties of H, obviating any need to calculate the full Hessian.

Updating Quasi-Newton Matrices With Limited Storage

An update formula which generates matrices using information from the last m iterations, where m is any number supplied by the user, and the BFGS method is considered to be the most efficient.

Automatic Preconditioning by Limited Memory Quasi-Newton Updating

A preconditioner for the conjugate gradient method that is designed for solving systems of equations Ax=bi with different right-hand-side vectors or for solving a sequence of slowly varying systems Ak x = bk is proposed.

Sample size selection in optimization methods for machine learning

A criterion for increasing the sample size based on variance estimates obtained during the computation of a batch gradient, and establishes an O(1/\epsilon) complexity bound on the total cost of a gradient method.

Large Scale Distributed Deep Networks

This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.

Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent

We propose a generic method for iteratively approximating various second-order gradient steps-Newton, Gauss-Newton, Levenberg-Marquardt, and natural gradient-in linear time per iteration, using

Hybrid Deterministic-Stochastic Methods for Data Fitting

Rate-of-convergence analysis shows that by controlling the sample size in an incremental gradient algorithm, it is possible to maintain the steady convergence rates of full-gradient methods.

An Introduction to the Conjugate Gradient Method Without the Agonizing Pain

The Conjugate Gradient Method is the most prominent iterative method for solving sparse systems of linear equations. Unfortunately, many textbook treatments of the topic are written so that even

Efficient Implementation of a Class of Preconditioned Conjugate Gradient Methods

The preconditioned conjugate gradient (PCG) method is an effective means for solving systems of linear equations where the coefficient matrix is symmetric and positive definite. The incomplete $LDL^t