Linear Stability Hypothesis and Rank Stratification for Nonlinear Models

  title={Linear Stability Hypothesis and Rank Stratification for Nonlinear Models},
  author={Yaoyu Zhang and Zhongwang Zhang and Leyang Zhang and Zhiwei Bai and Tao Luo and Zhi-Qin John Xu},
Models with nonlinear architectures/parameterizations such as deep neural networks (DNNs) are well known for their mysteriously good generalization performance at overparameterization. In this work, we tackle this mystery from a novel perspective focusing on the transition of the target recovery/fitting accuracy as a function of the training data size. We propose a rank stratification for general nonlinear models to uncover a model rank as an “effective size of parameters” for each function in… 



Kernel and Rich Regimes in Overparametrized Models

This work shows how the scale of the initialization controls the transition between the "kernel" and "rich" regimes and affects generalization properties in multilayer homogeneous models and highlights an interesting role for the width of a model in the case that the predictor is not identically zero at initialization.

An analytic theory of generalization dynamics and transfer learning in deep linear networks

An analytic theory of the nonlinear dynamics of generalization in deep linear networks, both within and across tasks is developed and reveals that knowledge transfer depends sensitively, but computably, on the SNRs and input feature alignments of pairs of tasks.

Gradient Descent Quantizes ReLU Network Features

An analysis of this behavior for feedforward networks with a ReLU activation function under the assumption of small initialization and learning rate and uncover a quantization effect that shows that for given input data there are only finitely many, "simple" functions that can be obtained, independent of the network size.

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

It is shown that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions.

The Implicit Bias of Minima Stability: A View from Function Space

This paper extends the existing knowledge on minima stability to non-differentiable minima, which are common in ReLU nets, and shows that SGD is biased towards functions whose second derivative has a bounded weighted L 1 norm, and this is regardless of the initialization.

Understanding deep learning requires rethinking generalization

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.

Geometry of the Loss Landscape in Overparameterized Neural Networks: Symmetries and Invariances

It is shown that adding one extra neuron to each layer is sufficient to connect all these previously discrete minima into a single manifold and provide new insights into the minimization of the non-convex loss function of overparameterized neural networks.

Implicit Regularization in Deep Matrix Factorization

This work studies the implicit regularization of gradient descent over deep linear neural networks for matrix completion and sensing, a model referred to as deep matrix factorization, and finds that adding depth to a matrix factorizations enhances an implicit tendency towards low-rank solutions.

Phase diagram for two-layer ReLU neural networks at infinite-width limit

The phase diagram for the two-layer ReLU NN serves as a map for the future studies and is a first step towards a more systematical investigation of the training behavior and the implicit regularization of NNs of different structures.

SGD with large step sizes learns sparse features

Observations unveil how, through the step size schedules, both gradient and noise drive together the SGD dynamics through the loss landscape of neural networks.