Corpus ID: 219260454

On the Promise of the Stochastic Generalized Gauss-Newton Method for Training DNNs

@article{Gargiani2020OnTP,
  title={On the Promise of the Stochastic Generalized Gauss-Newton Method for Training DNNs},
  author={Matilde Gargiani and Andrea Zanelli and Moritz Diehl and Frank Hutter},
  journal={ArXiv},
  year={2020},
  volume={abs/2006.02409}
}
Following early work on Hessian-free methods for deep learning, we study a stochastic generalized Gauss-Newton method (SGN) for training DNNs. SGN is a second-order optimization method, with efficient iterations, that we demonstrate to often require substantially fewer iterations than standard SGD to converge. As the name suggests, SGN uses a Gauss-Newton approximation for the Hessian matrix, and, in order to compute an approximate search direction, relies on the conjugate gradient method… Expand
Bilevel stochastic methods for optimization and machine learning: Bilevel stochastic descent and DARTS
TLDR
A practical bilevel stochastic gradient method (BSG-1) that requires neither lower level second-order derivatives nor system solves (and dismisses any matrix-vector products) and is close to first-order principles, which allows it to achieve a performance better than those that are not, such as DARTS. Expand
Flexible Modification of Gauss-Newton Method and Its Stochastic Extension
TLDR
It is shown that Gauss-Newton method in stochastic setting can effectively find solution under WGC and PL condition matching convergence rate of the deterministic optimization method. Expand
ViViT: Curvature access through the generalized Gauss-Newton's low-rank structure
TLDR
VIVIT is presented, a curvature model that leverages the GGN’s low-rank structure without further approximations that allows for efficient computation of eigenvalues, eigenvectors, as well as per-sample first and second-order directional derivatives and offers a fine-grained cost-accuracy trade-off. Expand

References

SHOWING 1-10 OF 26 REFERENCES
Numerical Optimization
  • D. Smith
  • Computer Science
  • J. Oper. Res. Soc.
  • 2001
no exception. MRP II and JIT=TQC in purchasing and supplier education are covered in Chapter 15. Without proper education MRP II and JIT=TQC will not be successful and will not generate their trueExpand
Deep Learning
Deep learning via Hessian-free optimization
TLDR
A 2nd-order optimization method based on the "Hessian-free" approach is developed, and applied to training deep auto-encoders, and results superior to those reported by Hinton & Salakhutdinov (2006) are obtained. Expand
New Insights and Perspectives on the Natural Gradient Method
  • James Martens
  • Computer Science, Mathematics
  • J. Mach. Learn. Res.
  • 2020
TLDR
This paper critically analyze this method and its properties, and shows how it can be viewed as a type of approximate 2nd-order optimization method, where the Fisher information matrix can be view as an approximation of the Hessian. Expand
Transferring Optimality Across Data Distributions via Homotopy Methods
TLDR
This work proposes a novel homotopy-based numerical method that can be used to transfer knowledge regarding the localization of an optimum across different task distributions in deep learning applications and validates the proposed methodology with some empirical evaluations. Expand
PyTorch: An Imperative Style, High-Performance Deep Learning Library
TLDR
This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. Expand
Empirical Analysis of the Hessian of Over-Parametrized Neural Networks
TLDR
A case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but are in fact connected through their flat region and so belong to the same basin. Expand
Automatic differentiation in machine learning: a survey
TLDR
By precisely defining the main differentiation techniques and their interrelationships, this work aims to bring clarity to the usage of the terms “autodiff’, “automatic differentiation”, and “symbolic differentiation" as these are encountered more and more in machine learning settings. Expand
Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms
TLDR
Fashion-MNIST is intended to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms, as it shares the same image size, data format and the structure of training and testing splits. Expand
Model Predictive Control: Theory, Computation and Design
  • Nob Hill, Madison, Wisconsin,
  • 2017
...
1
2
3
...