• Corpus ID: 218869556

Fractional moment-preserving initialization schemes for training fully-connected neural networks

  title={Fractional moment-preserving initialization schemes for training fully-connected neural networks},
  author={Mert G{\"u}rb{\"u}zbalaban and Yuanhan Hu},
A traditional approach to initialization in deep neural networks (DNNs) is to sample the network weights randomly for preserving the variance of pre-activations. On the other hand, several studies show that during the training process, the distribution of stochastic gradients can be heavy-tailed especially for small batch sizes. In this case, weights and therefore pre-activations can be modeled with a heavy-tailed distribution that has an infinite variance but has a finite (non-integer… 

Asymmetric Heavy Tails and Implicit Bias in Gaussian Noise Injections

This paper develops a Langevin-like stochastic differential equation that is driven by a general family of asymmetric heavy-tailed noise and formally proves that GNIs induce an ‘implicit bias’, which varies depending on the heaviness of the tails and the level of asymmetry.

Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance

The results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum without necessitating any modification neither to the loss function nor to the algorithm itself, as typically required in robust statistics.



Variance-Preserving Initialization Schemes Improve Deep Network Training: But Which Variance is Preserved?

The experimental results from training neural nets support the idea that preserving sample statistics can be better than preserving total variance, and discuss the implications for the alternative rule of thumb that a network should be initialized to be at the "edge of chaos".

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.

A Mean Field Theory of Quantized Deep Networks: The Quantization-Depth Trade-Off

This work applies mean-field techniques to networks with quantized activations in order to evaluate the degree to which quantization degrades signal propagation at initialization, and derives initialization schemes which maximize signal propagation in such networks.

Sample Variance Decay in Randomly Initialized ReLU Networks

It is demonstrated that even when the total variance is preserved, the sample variance decays in the later layers through an analytical calculation in the limit of infinite network width, and numerical simulations for finite width.

Spectrum Concentration in Deep Residual Learning: A Free Probability Approach

It is empirically demonstrated that the proposed initialization scheme learns at a speed of orders of magnitudes faster than the classical ones, and thus attests a strong practical relevance of this investigation.

Traditional and Heavy-Tailed Self Regularization in Neural Network Models

A novel form of Heavy-Tailed Self-Regularization is identified, similar to the self-organization seen in the statistical physics of disordered systems, which can depend strongly on the many knobs of the training process.

On the Selection of Initialization and Activation Function for Deep Neural Networks

This analysis identifies a class of activation functions that improve the information propagation over ReLU-like functions that includes the Swish activation, which provides a theoretical grounding for the excellent empirical performance of $\phi_{swish}$ observed in these contributions.

Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10, 000-Layer Vanilla Convolutional Neural Networks

This work demonstrates that it is possible to train vanilla CNNs with ten thousand layers or more simply by using an appropriate initialization scheme, and presents an algorithm for generating such random initial orthogonal convolution kernels.

On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

It is argued that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate and establish an explicit connection between the convergence rate of SGD to a local minimum and the tail-index $\alpha$.

MetaInit: Initializing learning by learning to initialize

This work introduces an algorithm called MetaInit, based on a hypothesis that good initializations make gradient descent easier by starting in regions that look locally linear with minimal second order effects, which minimizes this quantity efficiently by using gradient descent to tune the norms of the initial weight matrices.