• Corpus ID: 85529425

A Sober Look at Neural Network Initializations

  title={A Sober Look at Neural Network Initializations},
  author={Ingo Steinwart},
Initializing the weights and the biases is a key part of the training process of a neural network. Unlike the subsequent optimization phase, however, the initialization phase has gained only limited attention in the literature. In this paper we discuss some consequences of commonly used initialization strategies for vanilla DNNs with ReLU activations. Based on these insights we then develop an alternative initialization strategy. Finally, we present some large scale experiments assessing the… 
Training Two-Layer ReLU Networks with Gradient Descent is Inconsistent
A large class of one-dimensional data-generating distributions for which, with high probability, gradient descent only finds a bad local minimum of the optimization landscape, turns out that in these cases, the found network essentially performs linear regression even if the target function is non-linear.
Persistent Neurons
It is shown that persistent neurons, under certain data distribution, is able to converge to more optimal solutions while initializations under popular framework find bad local minima and helps improve the model's performance under both good and poor initializations.
On the Expected Complexity of Maxout Networks
This work shows that the practical complexity of deep ReLU networks is often far from the theoretical maximum, and shows that this phenomenon also occurs in networks with maxout (multi-argument) activation functions and when considering the decision boundaries in classification tasks.
Convergence Analysis of Neural Networks
This work defines a submanifold of all data distributions on which gradient descent fails to spread the nonlinearities across the data with high probability, i.e. it only finds a bad local minimum or valley of the optimization landscape.
Shallow Univariate ReLU Networks as Splines: Initialization, Loss Surface, Hessian, and Gradient Flow Dynamics
This work proposes taking a quotient with respect to the second symmetry group and reparametrizing ReLU NNs as continuous piecewise linear splines, developing a surprisingly simple and transparent view of the structure of the loss surface.
Lower and Upper Bounds for Numbers of Linear Regions of Graph Convolutional Networks
An optimal upper bound for the maximum number of linear regions for one-layer GCNs, and the upper and lower bounds for multi-layerGCNs are obtained, which implies that deeper GCNs have more expressivity than shallow GCNs.
Sharp bounds for the number of regions of maxout networks and vertices of Minkowski sums
Face counting formulas in terms of the intersection posets of tropical hypersurfaces or the number of upper faces of partial Minkowski sums are obtained, along with explicit sharp upper bounds for theNumber of regions for any input dimension, any number of units, and any ranks are obtained.


Understanding the difficulty of training deep feedforward neural networks
The objective here is to understand better why standard gradient descent from random initialization is doing so poorly with deep neural networks, to better understand these recent relative successes and help design better algorithms in the future.
Self-Normalizing Neural Networks
Self-normalizing neural networks (SNNs) are introduced to enable high-level abstract representations and it is proved that activations close to zero mean and unit variance that are propagated through many network layers will converge towards zero meanand unit variance -- even under the presence of noise and perturbations.
Deep Learning
Deep learning is making major advances in solving problems that have resisted the best attempts of the artificial intelligence community for many years, and will have many more successes in the near future because it requires very little engineering by hand and can easily take advantage of increases in the amount of available computation and data.
Non-Uniform Random Variate Generation
This chapter reviews the main methods for generating random variables, vectors and processes in non-uniform random variate generation, and provides information on the expected time complexity of various algorithms before addressing modern topics such as indirectly specified distributions, random processes, and Markov chain methods.
Support Vector Machines
This book explains the principles that make support vector machines (SVMs) a successful modelling and prediction tool for a variety of applications and provides a unique in-depth treatment of both fundamental and recent material on SVMs that so far has been scattered in the literature.
Some extensions of W. Gautschi’s inequalities for the gamma function
It has been shown by W. Gautschi that if 0 I Xi-s < F(x ) < exp[(I s)x + 1)]. The following closer bounds are proved: exp[(I s)4(x + 12)] < F + ) < exp[(I s) (x + s I)] F(x ? s)2 and [x + 2] <t <[
Superconcentration and Related Topics
Preface.- 1.Introduction.- 2.Markov semigroups.- 3.Super concentration and chaos.- 4.Multiple valleys.- 5.Talagrand's method for proving super concentration.- 6.The spectral method for proving super
Uniform Bounds for the Complementary Incomplete Gamma Function
We prove upper and lower bounds for the complementary incomplete gamma function $\G(a,z)$ with complex parameters $a$ and $z$. Our bounds are refined within the circular hyperboloid of one sheet
High Dimensional Probability III
I. Measures on General Spaces and Inequalities.- Stochastic inequalities and perfect independence.- Prokhorov-LeCam-Varadarajan's compactness criteria for vector measures on metric spaces.- On