Fractional moment-preserving initialization schemes for training fully-connected neural networks
@article{Grbzbalaban2020FractionalMI, title={Fractional moment-preserving initialization schemes for training fully-connected neural networks}, author={Mert G{\"u}rb{\"u}zbalaban and Yuanhan Hu}, journal={ArXiv}, year={2020}, volume={abs/2005.11878} }
A traditional approach to initialization in deep neural networks (DNNs) is to sample the network weights randomly for preserving the variance of pre-activations. On the other hand, several studies show that during the training process, the distribution of stochastic gradients can be heavy-tailed especially for small batch sizes. In this case, weights and therefore pre-activations can be modeled with a heavy-tailed distribution that has an infinite variance but has a finite (non-integer…
Figures from this paper
2 Citations
Asymmetric Heavy Tails and Implicit Bias in Gaussian Noise Injections
- Computer ScienceICML
- 2021
This paper develops a Langevin-like stochastic differential equation that is driven by a general family of asymmetric heavy-tailed noise and formally proves that GNIs induce an ‘implicit bias’, which varies depending on the heaviness of the tails and the level of asymmetry.
Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance
- Mathematics, Computer ScienceNeurIPS
- 2021
The results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum without necessitating any modification neither to the loss function nor to the algorithm itself, as typically required in robust statistics.
References
SHOWING 1-10 OF 62 REFERENCES
Variance-Preserving Initialization Schemes Improve Deep Network Training: But Which Variance is Preserved?
- Computer ScienceArXiv
- 2019
The experimental results from training neural nets support the idea that preserving sample statistics can be better than preserving total variance, and discuss the implications for the alternative rule of thumb that a network should be initialized to be at the "edge of chaos".
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
- Computer ScienceICML
- 2015
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
A Mean Field Theory of Quantized Deep Networks: The Quantization-Depth Trade-Off
- Computer ScienceNeurIPS
- 2019
This work applies mean-field techniques to networks with quantized activations in order to evaluate the degree to which quantization degrades signal propagation at initialization, and derives initialization schemes which maximize signal propagation in such networks.
Sample Variance Decay in Randomly Initialized ReLU Networks
- Computer Science
- 2019
It is demonstrated that even when the total variance is preserved, the sample variance decays in the later layers through an analytical calculation in the limit of infinite network width, and numerical simulations for finite width.
Spectrum Concentration in Deep Residual Learning: A Free Probability Approach
- Computer ScienceIEEE Access
- 2019
It is empirically demonstrated that the proposed initialization scheme learns at a speed of orders of magnitudes faster than the classical ones, and thus attests a strong practical relevance of this investigation.
Traditional and Heavy-Tailed Self Regularization in Neural Network Models
- Computer ScienceICML
- 2019
A novel form of Heavy-Tailed Self-Regularization is identified, similar to the self-organization seen in the statistical physics of disordered systems, which can depend strongly on the many knobs of the training process.
On the Selection of Initialization and Activation Function for Deep Neural Networks
- Computer ScienceArXiv
- 2018
This analysis identifies a class of activation functions that improve the information propagation over ReLU-like functions that includes the Swish activation, which provides a theoretical grounding for the excellent empirical performance of $\phi_{swish}$ observed in these contributions.
Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks
- Computer ScienceICML
- 2018
This work demonstrates that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme, and presents an algorithm for generating such random initial orthogonal convolution kernels.
On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks
- Computer ScienceArXiv
- 2019
It is argued that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate and establish an explicit connection between the convergence rate of SGD to a local minimum and the tail-index $\alpha$.
MetaInit: Initializing learning by learning to initialize
- Computer ScienceNeurIPS
- 2019
This work introduces an algorithm called MetaInit, based on a hypothesis that good initializations make gradient descent easier by starting in regions that look locally linear with minimal second order effects, which minimizes this quantity efficiently by using gradient descent to tune the norms of the initial weight matrices.