# Learning Rates as a Function of Batch Size: A Random Matrix Theory Approach to Neural Network Training.

@article{Granziol2020LearningRA, title={Learning Rates as a Function of Batch Size: A Random Matrix Theory Approach to Neural Network Training.}, author={Diego Granziol and Stefan Zohren and Stephen J. Roberts}, journal={arXiv: Machine Learning}, year={2020} }

We study the effect of mini-batching on the loss landscape of deep neural networks using spiked, field-dependent random matrix theory. We demonstrate that the magnitude of the extremal values of the batch Hessian are larger than those of the empirical Hessian. We also derive similar results for the Generalised Gauss-Newton matrix approximation of the Hessian. As a consequence of our theorems we derive an analytical expressions for the maximal learning rates as a function of batch size…

## 8 Citations

### On the SDEs and Scaling Rules for Adaptive Gradient Algorithms

- Computer ScienceArXiv
- 2022

This paper derives the SDE approximations for RMSprop and Adam, giving theoretical guarantees of their correctness as well as experimental validation of their applicability to common large-scaling vision and language settings.

### A Random Matrix Theory Approach to the Adaptive Gradient Generalisation Gap

- Computer Science
- 2021

Major generalisation improvements are demonstrated by increasing the shrinkage coefficient, closing the generalisation gap entirely in both Logistic Regression and Deep Neural Network experiments and it is shown that other popular modifications to adaptive methods can be shown to calibrate parameter updates to make better use of sharper, more reliable directions.

### Few-shot Time-Series Forecasting with Application for Vehicular Traffic Flow

- Computer Science2022 IEEE 23rd International Conference on Information Reuse and Integration for Data Science (IRI)
- 2022

Deep neural network architectures for few-shot forecasting using a Siamese twin network approach to learn a difference function between pairs of time-series, rather than directly forecasting based on historical data as seen in traditional forecasting models.

### Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training

- Computer ScienceArXiv
- 2022

Zeus, an optimization framework to navigate the tradeoff between energy consumption and performance optimization by automatically automatically selecting optimal job-and GPU-level conﬁgurations for recurring DNN training jobs is proposed.

### Universal characteristics of deep neural network loss surfaces from random matrix theory

- Computer ScienceArXiv
- 2022

This paper considers several aspects of random matrix universality in deep neural networks. Motivated by recent experimental work, we use universal properties of random matrices related to local…

### A random matrix theory approach to damping in deep learning

- Computer Science
- 2022

A novel random matrix theory based damping learner for second order optimisers inspired by linear shrinkage estimation is developed, and it is demonstrated that the derived method works well with adaptive gradient methods such as Adam.

### Deformed semicircle law and concentration of nonlinear random matrices for ultra-wide neural networks

- Mathematics, Computer ScienceArXiv
- 2021

A nonlinear Hanson-Wright inequality suitable for neural networks with random weights and Lipschitz activation functions is provided, and it is verified the random feature regression achieves the same asymptotic performance as its limiting kernel regression in ultra-width limit.

### Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models

- Computer ScienceNeurIPS
- 2021

The Volterra equation is derived, a heuristic derivation of the homogenized SGD approximation to the SDA class of algorithms on the least squares problem is shown and it is shown that SGD and homogenization SGD are close under orthogonal invariance.

## References

SHOWING 1-10 OF 104 REFERENCES

### Traces of Class/Cross-Class Structure Pervade Deep Learning Spectra

- Computer ScienceArXiv
- 2020

The ratio of outliers to bulk in the spectrum of the Fisher information matrix is predictive of misclassification, in the context of multinomial logistic regression and a correction to KFAC, a well-known second-order optimization algorithm for training deepnets is proposed.

### Automatic differentiation in PyTorch

- Computer Science
- 2017

An automatic differentiation module of PyTorch is described — a library designed to enable rapid research on machine learning models that focuses on differentiation of purely imperative programs, with a focus on extensibility and low overhead.

### Train longer, generalize better: closing the generalization gap in large batch training of neural networks

- Computer ScienceNIPS
- 2017

This work proposes a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior and presents a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates.

### Deep Residual Learning for Image Recognition

- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.

### Optimizing Neural Networks with Kronecker-factored Approximate Curvature

- Computer ScienceICML
- 2015

K-FAC is an efficient method for approximating natural gradient descent in neural networks which is based on an efficiently invertible approximation of a neural network's Fisher information matrix which is neither diagonal nor low-rank, and in some cases is completely non-sparse.

### Very Deep Convolutional Networks for Large-Scale Image Recognition

- Computer ScienceICLR
- 2015

This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

### Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

- Computer ScienceICML
- 2015

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

### An Investigation into Neural Net Optimization via Hessian Eigenvalue Density

- Computer ScienceICML
- 2019

To understand the dynamics of optimization in deep neural networks, we develop a tool to study the evolution of the entire Hessian spectrum throughout the optimization process. Using this, we study a…

### The Lanczos and conjugate gradient algorithms in finite precision arithmetic

- Computer Science, MathematicsActa Numerica
- 2006

A tribute is paid to those who have made an understanding of the Lanczos and conjugate gradient algorithms possible through their pioneering work, and to review recent solutions of several open problems that have also contributed to knowledge of the subject.

### The Full Spectrum of Deep Net Hessians At Scale: Dynamics with Sample Size

- Computer ScienceArXiv
- 2018

The results corroborate previous findings, based on small-scale networks, that the Hessian exhibits 'spiked' behavior, with several outliers isolated from a continuous bulk, but find that the bulk does not follow a simple Marchenko-Pastur distribution, as previously suggested, but rather a heavier-tailed distribution.