• Corpus ID: 218869556

# Fractional moment-preserving initialization schemes for training fully-connected neural networks

@article{Grbzbalaban2020FractionalMI,
title={Fractional moment-preserving initialization schemes for training fully-connected neural networks},
author={Mert G{\"u}rb{\"u}zbalaban and Yuanhan Hu},
journal={ArXiv},
year={2020},
volume={abs/2005.11878}
}
• Published 25 May 2020
• Computer Science
• ArXiv
A traditional approach to initialization in deep neural networks (DNNs) is to sample the network weights randomly for preserving the variance of pre-activations. On the other hand, several studies show that during the training process, the distribution of stochastic gradients can be heavy-tailed especially for small batch sizes. In this case, weights and therefore pre-activations can be modeled with a heavy-tailed distribution that has an infinite variance but has a finite (non-integer…
2 Citations

## Figures from this paper

This paper develops a Langevin-like stochastic differential equation that is driven by a general family of asymmetric heavy-tailed noise and formally proves that GNIs induce an ‘implicit bias’, which varies depending on the heaviness of the tails and the level of asymmetry.
• Mathematics, Computer Science
NeurIPS
• 2021
The results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum without necessitating any modification neither to the loss function nor to the algorithm itself, as typically required in robust statistics.

## References

SHOWING 1-10 OF 62 REFERENCES

• Computer Science
ArXiv
• 2019
The experimental results from training neural nets support the idea that preserving sample statistics can be better than preserving total variance, and discuss the implications for the alternative rule of thumb that a network should be initialized to be at the "edge of chaos".
• Computer Science
ICML
• 2015
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
• Computer Science
NeurIPS
• 2019
This work applies mean-field techniques to networks with quantized activations in order to evaluate the degree to which quantization degrades signal propagation at initialization, and derives initialization schemes which maximize signal propagation in such networks.
• Computer Science
• 2019
It is demonstrated that even when the total variance is preserved, the sample variance decays in the later layers through an analytical calculation in the limit of infinite network width, and numerical simulations for finite width.
• Computer Science
IEEE Access
• 2019
It is empirically demonstrated that the proposed initialization scheme learns at a speed of orders of magnitudes faster than the classical ones, and thus attests a strong practical relevance of this investigation.
• Computer Science
ICML
• 2019
A novel form of Heavy-Tailed Self-Regularization is identified, similar to the self-organization seen in the statistical physics of disordered systems, which can depend strongly on the many knobs of the training process.
• Computer Science
ArXiv
• 2018
This analysis identifies a class of activation functions that improve the information propagation over ReLU-like functions that includes the Swish activation, which provides a theoretical grounding for the excellent empirical performance of $\phi_{swish}$ observed in these contributions.
• Computer Science
ICML
• 2018
This work demonstrates that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme, and presents an algorithm for generating such random initial orthogonal convolution kernels.
• Computer Science
ArXiv
• 2019
It is argued that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate and establish an explicit connection between the convergence rate of SGD to a local minimum and the tail-index $\alpha$.
• Computer Science
NeurIPS
• 2019
This work introduces an algorithm called MetaInit, based on a hypothesis that good initializations make gradient descent easier by starting in regions that look locally linear with minimal second order effects, which minimizes this quantity efficiently by using gradient descent to tune the norms of the initial weight matrices.