• Corpus ID: 230437577

An iterative K-FAC algorithm for Deep Learning

@article{Chen2021AnIK,
  title={An iterative K-FAC algorithm for Deep Learning},
  author={Yingshi Chen},
  journal={ArXiv},
  year={2021},
  volume={abs/2101.00218}
}
  • Yingshi Chen
  • Published 1 January 2021
  • Computer Science, Mathematics
  • ArXiv
Kronecker-factored Approximate Curvature (K-FAC) method is a high efficiency second order optimizer for the deep learning. Its training time is less than SGD(or other first-order method) with same accuracy in many large-scale problems. The key of K-FAC is to approximates Fisher information matrix (FIM) as a block-diagonal matrix where each block is an inverse of tiny Kronecker factors. In this short note, we present CG-FAC — an new iterative K-FAC algorithm. It uses conjugate gradient method to… 
1 Citations
The Brownian motion in the transformer model
TLDR
A deep analysis of its multi-head self-attention (MHSA) module is given and it is found that each token is a random variable in high dimensional feature space and after layer normalization, these variables are mapped to points on the hyper-sphere.

References

SHOWING 1-10 OF 24 REFERENCES
Optimizing Neural Networks with Kronecker-factored Approximate Curvature
TLDR
K-FAC is an efficient method for approximating natural gradient descent in neural networks which is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse.
Convolutional Neural Network Training with Distributed K-FAC
TLDR
A scalable K-FAC design and its applicability in convolutional neural network (CNN) training at scale is investigated and optimization techniques such as layer-wise distribution strategies, inverse-free second-order gradient evaluation, and dynamic K- FAC update decoupling are studied to reduce training time while preserving convergence.
A Kronecker-factored approximate Fisher matrix for convolution layers
Second-order optimization methods such as natural gradient descent have the potential to speed up training of neural networks by correcting for the curvature of the loss function. Unfortunately, the
On the Use of Stochastic Hessian Information in Optimization Methods for Machine Learning
TLDR
Curvature information is incorporated in two subsampled Hessian algorithms, one based on a matrix-free inexact Newton iteration and one on a preconditioned limited memory BFGS iteration.
A Stochastic Quasi-Newton Method for Large-Scale Optimization
TLDR
A stochastic quasi-Newton method that is efficient, robust and scalable, and employs the classical BFGS update formula in its limited memory form, based on the observation that it is beneficial to collect curvature information pointwise, and at regular intervals, through (sub-sampled) Hessian-vector products.
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
TLDR
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.
Optimization Methods for Large-Scale Machine Learning
TLDR
A major theme of this study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter, leading to a discussion about the next generation of optimization methods for large- scale machine learning.
Natural Gradient Works Efficiently in Learning
  • S. Amari
  • Computer Science, Mathematics
    Neural Computation
  • 1998
TLDR
The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters.
New Insights and Perspectives on the Natural Gradient Method
  • James Martens
  • Computer Science, Mathematics
    J. Mach. Learn. Res.
  • 2020
TLDR
This paper critically analyze this method and its properties, and shows how it can be viewed as a type of approximate 2nd-order optimization method, where the Fisher information matrix can be view as an approximation of the Hessian.
...
1
2
3
...