# An iterative K-FAC algorithm for Deep Learning

@article{Chen2021AnIK, title={An iterative K-FAC algorithm for Deep Learning}, author={Yingshi Chen}, journal={ArXiv}, year={2021}, volume={abs/2101.00218} }

Kronecker-factored Approximate Curvature (K-FAC) method is a high efficiency second order optimizer for the deep learning. Its training time is less than SGD(or other first-order method) with same accuracy in many large-scale problems. The key of K-FAC is to approximates Fisher information matrix (FIM) as a block-diagonal matrix where each block is an inverse of tiny Kronecker factors. In this short note, we present CG-FAC — an new iterative K-FAC algorithm. It uses conjugate gradient method to…

## One Citation

The Brownian motion in the transformer model

- Computer ScienceArXiv
- 2021

A deep analysis of its multi-head self-attention (MHSA) module is given and it is found that each token is a random variable in high dimensional feature space and after layer normalization, these variables are mapped to points on the hyper-sphere.

## References

SHOWING 1-10 OF 24 REFERENCES

Optimizing Neural Networks with Kronecker-factored Approximate Curvature

- Computer Science, MathematicsICML
- 2015

K-FAC is an efficient method for approximating natural gradient descent in neural networks which is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse.

Convolutional Neural Network Training with Distributed K-FAC

- Computer Science, MathematicsSC
- 2020

A scalable K-FAC design and its applicability in convolutional neural network (CNN) training at scale is investigated and optimization techniques such as layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K- FAC update decoupling are studied to reduce training time while preserving convergence.

A Kronecker-factored approximate Fisher matrix for convolution layers

- Mathematics, Computer ScienceICML
- 2016

Second-order optimization methods such as natural gradient descent have the potential to speed up training of neural networks by correcting for the curvature of the loss function. Unfortunately, the…

On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning

- Mathematics, Computer ScienceSIAM J. Optim.
- 2011

Curvature information is incorporated in two subsampled Hessian algorithms, one based on a matrix-free inexact Newton iteration and one on a preconditioned limited memory BFGS iteration.

A Stochastic Quasi-Newton Method for Large-Scale Optimization

- Mathematics, Computer ScienceSIAM J. Optim.
- 2016

A stochastic quasi-Newton method that is efficient, robust and scalable, and employs the classical BFGS update formula in its limited memory form, based on the observation that it is beneficial to collect curvature information pointwise, and at regular intervals, through (sub-sampled) Hessian-vector products.

Adam: A Method for Stochastic Optimization

- Computer Science, MathematicsICLR
- 2015

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

- Computer Science, MathematicsJ. Mach. Learn. Res.
- 2011

This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

Optimization Methods for Large-Scale Machine Learning

- Computer Science, MathematicsSIAM Rev.
- 2018

A major theme of this study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter, leading to a discussion about the next generation of optimization methods for large- scale machine learning.

Natural Gradient Works Efficiently in Learning

- Computer Science, MathematicsNeural Computation
- 1998

The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters.

New Insights and Perspectives on the Natural Gradient Method

- Computer Science, MathematicsJ. Mach. Learn. Res.
- 2020

This paper critically analyze this method and its properties, and shows how it can be viewed as a type of approximate 2nd-order optimization method, where the Fisher information matrix can be view as an approximation of the Hessian.