Shallow Univariate ReLU Networks as Splines: Initialization, Loss Surface, Hessian, and Gradient Flow Dynamics

@article{Sahs2022ShallowUR,
  title={Shallow Univariate ReLU Networks as Splines: Initialization, Loss Surface, Hessian, and Gradient Flow Dynamics},
  author={Justin Sahs and Ryan Pyle and Aneel Damaraju and Josue Ortega Caro and Onur Tavaslioglu and Andy Lu and Ankit B. Patel},
  journal={Frontiers in Artificial Intelligence},
  year={2022},
  volume={5}
}
Understanding the learning dynamics and inductive bias of neural networks (NNs) is hindered by the opacity of the relationship between NN parameters and the function represented. Partially, this is due to symmetries inherent within the NN parameterization, allowing multiple different parameter settings to result in an identical output function, resulting in both an unclear relationship and redundant degrees of freedom. The NN parameterization is invariant under two symmetries: permutation of… 
Mean-field Analysis of Piecewise Linear Solutions for Wide ReLU Networks
TLDR
This work takes a mean-field view, and considers a two-layer ReLU network trained via noisy-SGD for a univariate regularized regression problem, and shows that SGD with vanishingly small noise injected in the gradients is biased towards a simple solution.
On the Effective Number of Linear Regions in Shallow Univariate ReLU Networks: Convergence Guarantees and Implicit Bias
TLDR
It is shown that when the labels are determined by the sign of a target network with r neurons, GF converges in direction to a network achieving perfect training accuracy and having at most O ( r ) linear regions, implying a generalization bound.
Using Learning Dynamics to Explore the Role of Implicit Regularization in Adversarial Examples
TLDR
Analyzing the learning dynamics of perturbations can provide useful insights for understanding the origin of adversarial sensitivities and developing robust solutions, as well as providing a simple theoretical explanation for these observations.
Local Convolutions Cause an Implicit Bias towards High Frequency Adversarial Examples
TLDR
It is found that the high-frequency adversarial perturbations are critically dependent on the convolution operation because the spatially-limited nature of local convolutions induces an implicit bias towards high frequency features.
Two-Argument Activation Functions Learn Soft XOR Operations Like Cortical Neurons
TLDR
This work emulates more biologically realistic neurons by learning canonical activation functions with two input arguments, analogous to basal and apical dendrites, in a network-in-network architecture where each neuron is modeled as a multilayer perceptron with two inputs and a single output.
Domain-driven models yield better predictions at lower cost than reservoir computers in Lorenz systems
TLDR
It is shown that, surprisingly, the least expensive D2R2 method yields the most robust results and the greatest savings compared to ESNs, and is a generalization of the well-known SINDy algorithm.

References

SHOWING 1-10 OF 60 REFERENCES
Empirical Analysis of the Hessian of Over-Parametrized Neural Networks
TLDR
A case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but are in fact connected through their flat region and so belong to the same basin.
Gradient Dynamics of Shallow Univariate ReLU Networks
TLDR
A theoretical and empirical study of the gradient dynamics of overparameterized shallow ReLU networks with one-dimensional input, solving least-squares interpolation shows that learning in the kernel regime yields smooth interpolants, minimizing curvature, and reduces to cubic splines for uniform initializations.
Neural Mechanics: Symmetry and Broken Conservation Laws in Deep Learning Dynamics
TLDR
By exploiting symmetry, this work demonstrates that it can analytically describe the learning dynamics of various parameter combinations at finite learning rates and batch sizes for state of the art architectures trained on any dataset.
Towards understanding the true loss surface of deep neural networks using random matrix theory and iterative spectral methods
TLDR
A framework for spectral visualization, based on GPU accelerated stochastic Lanczos quadrature, is proposed that is an order of magnitude faster than state-of-the-art methods for spectral visualize, and can be generically used to investigate the spectral properties of matrices in deep learning.
An Investigation into Neural Net Optimization via Hessian Eigenvalue Density
To understand the dynamics of optimization in deep neural networks, we develop a tool to study the evolution of the entire Hessian spectrum throughout the optimization process. Using this, we study a
A Spline Theory of Deep Networks
TLDR
A simple penalty term is proposed that can be added to the cost function of any DN learning algorithm to force the templates to be orthogonal with each other, which leads to significantly improved classification performance and reduced overfitting with no change to the DN architecture.
A Functional Characterization of Randomly Initialized Gradient Descent in Deep ReLU Networks
TLDR
A functional view of deep neural networks gives a useful new lens with which to understand them, and one key result is that generalization results from smoothness of the functional approximation, combined with a flat initial approximation.
Geometry of Neural Network Loss Surfaces via Random Matrix Theory
TLDR
An analytical framework and a set of tools from random matrix theory that allow us to compute an approximation of the distribution of eigenvalues of the Hessian matrix at critical points of varying energy are introduced.
How noise affects the Hessian spectrum in overparameterized neural networks
TLDR
It is shown that for overparameterized networks with a degenerate valley in their loss landscape, SGD on average decreases the trace of the Hessian of the loss.
A Deeper Look at the Hessian Eigenspectrum of Deep Neural Networks and its Applications to Regularization
TLDR
This work proposes a layerwise loss landscape analysis where the loss surface at every layer is studied independently and also on how each correlates to the overall loss surface and shows that the layerwise Hessian geometry is largely similar to the entire Hessian.
...
...