# On the Promise of the Stochastic Generalized Gauss-Newton Method for Training DNNs

@article{Gargiani2020OnTP, title={On the Promise of the Stochastic Generalized Gauss-Newton Method for Training DNNs}, author={Matilde Gargiani and Andrea Zanelli and Moritz Diehl and Frank Hutter}, journal={ArXiv}, year={2020}, volume={abs/2006.02409} }

Following early work on Hessian-free methods for deep learning, we study a stochastic generalized Gauss-Newton method (SGN) for training DNNs. SGN is a second-order optimization method, with efficient iterations, that we demonstrate to often require substantially fewer iterations than standard SGD to converge. As the name suggests, SGN uses a Gauss-Newton approximation for the Hessian matrix, and, in order to compute an approximate search direction, relies on the conjugate gradient method… Expand

#### Figures, Tables, and Topics from this paper

#### 3 Citations

Bilevel stochastic methods for optimization and machine learning: Bilevel stochastic descent and DARTS

- Computer Science, Mathematics
- ArXiv
- 2021

A practical bilevel stochastic gradient method (BSG-1) that requires neither lower level second-order derivatives nor system solves (and dismisses any matrix-vector products) and is close to first-order principles, which allows it to achieve a performance better than those that are not, such as DARTS. Expand

Flexible Modification of Gauss-Newton Method and Its Stochastic Extension

- Mathematics, Computer Science
- 2021

It is shown that Gauss-Newton method in stochastic setting can effectively find solution under WGC and PL condition matching convergence rate of the deterministic optimization method. Expand

ViViT: Curvature access through the generalized Gauss-Newton's low-rank structure

- Computer Science, Mathematics
- ArXiv
- 2021

VIVIT is presented, a curvature model that leverages the GGN’s low-rank structure without further approximations that allows for efficient computation of eigenvalues, eigenvectors, as well as per-sample first and second-order directional derivatives and offers a fine-grained cost-accuracy trade-off. Expand

#### References

SHOWING 1-10 OF 26 REFERENCES

Numerical Optimization

- Computer Science
- J. Oper. Res. Soc.
- 2001

no exception. MRP II and JIT=TQC in purchasing and supplier education are covered in Chapter 15. Without proper education MRP II and JIT=TQC will not be successful and will not generate their true… Expand

Deep learning via Hessian-free optimization

- Computer Science
- ICML
- 2010

A 2nd-order optimization method based on the "Hessian-free" approach is developed, and applied to training deep auto-encoders, and results superior to those reported by Hinton & Salakhutdinov (2006) are obtained. Expand

New Insights and Perspectives on the Natural Gradient Method

- Computer Science, Mathematics
- J. Mach. Learn. Res.
- 2020

This paper critically analyze this method and its properties, and shows how it can be viewed as a type of approximate 2nd-order optimization method, where the Fisher information matrix can be view as an approximation of the Hessian. Expand

Transferring Optimality Across Data Distributions via Homotopy Methods

- Computer Science
- ICLR
- 2020

This work proposes a novel homotopy-based numerical method that can be used to transfer knowledge regarding the localization of an optimum across different task distributions in deep learning applications and validates the proposed methodology with some empirical evaluations. Expand

PyTorch: An Imperative Style, High-Performance Deep Learning Library

- Computer Science, Mathematics
- NeurIPS
- 2019

This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance. Expand

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

- Computer Science, Mathematics
- ICLR
- 2018

A case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but are in fact connected through their flat region and so belong to the same basin. Expand

Automatic differentiation in machine learning: a survey

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2017

By precisely defining the main differentiation techniques and their interrelationships, this work aims to bring clarity to the usage of the terms “autodiff’, “automatic differentiation”, and “symbolic differentiation" as these are encountered more and more in machine learning settings. Expand

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

- Computer Science, Mathematics
- ArXiv
- 2017

Fashion-MNIST is intended to serve as a direct drop-in replacement for the original MNIST dataset for benchmarking machine learning algorithms, as it shares the same image size, data format and the structure of training and testing splits. Expand

Model Predictive Control: Theory, Computation and Design

- Nob Hill, Madison, Wisconsin,
- 2017