• Corpus ID: 244117259

The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods

@article{Ghosh2022TheTS,
  title={The Three Stages of Learning Dynamics in High-Dimensional Kernel Methods},
  author={Nikhil Ghosh and Song Mei and Bin Yu},
  journal={ArXiv},
  year={2022},
  volume={abs/2111.07167}
}
To understand how deep learning works, it is crucial to understand the training dynamics of neural networks. Several interesting hypotheses about these dynamics have been made based on empirically observed phenomena, but there exists a limited theoretical understanding of when and why such phenomena occur. In this paper, we consider the training dynamics of gradient flow on kernel least-squares objectives, which is a limiting dynamics of SGD trained neural networks. Using precise… 

Figures from this paper

Excess Risk of Two-Layer ReLU Neural Networks in Teacher-Student Settings and its Superiority to Kernel Methods

TLDR
This work investigates the excess risk of two-layer ReLU neural networks in a teacher-student regression model, in which a student network learns an unknown teacher network through its outputs, and shows that the student network provably reaches a near-global optimal solution and outperforms any kernel methods estimator, including neural tangent kernel approach, random feature model, and other kernel methods, in a sense of the minimax optimal rate.

Identifying good directions to escape the NTK regime and efficiently learn low-degree plus sparse polynomials

TLDR
This paper analyzes how gradient descent on a two-layer neural network can escape theNTK regime by utilizing a spectral characterization of the NTK and building on the QuadNTK approach, and constructs a regularizer which encourages the parameter vector to move in the “good” directions and yields an end to end convergence and generalization guarantee.

Generalization Properties of NAS under Activation and Skip Connection Search

TLDR
This work derives the lower (and upper) bounds of the minimum eigenvalue of Neural Tangent Kernel under the (in)finite width regime from a search space including mixed activation functions, fully connected, and residual neural networks, and leverages the eigen Value bounds to establish generalization error bounds of NAS in the stochastic gradient descent training.

References

SHOWING 1-10 OF 43 REFERENCES

Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel

TLDR
A large-scale phenomenological analysis of training reveals a striking correlation between a diverse set of metrics over training time, governed by a rapid chaotic to stable transition in the first few epochs, that together poses challenges and opportunities for the development of more accurate theories of deep learning.

SGD on Neural Networks Learns Functions of Increasing Complexity

TLDR
Key to the work is a new measure of how well one classifier explains the performance of another, based on conditional mutual information, which can be helpful in explaining why SGD-learned classifiers tend to generalize well even in the over-parameterized regime.

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

TLDR
It is shown that deep linear networks exhibit nonlinear learning phenomena similar to those seen in simulations of nonlinear networks, including long plateaus followed by rapid transitions to lower error solutions, and faster convergence from greedy unsupervised pretraining initial conditions than from random initial conditions.

The Early Phase of Neural Network Training

TLDR
It is found that deep networks are not robust to reinitializing with random weights while maintaining signs, and that weight distributions are highly non-independent even after only a few hundred iterations.

Towards Understanding the Spectral Bias of Deep Learning

TLDR
It is proved that the training process of neural networks can be decomposed along different directions defined by the eigenfunctions of the neural tangent kernel, where each direction has its own convergence rate and the rate is determined by the corresponding eigenvalue.

When do neural networks outperform kernel methods?

TLDR
It is shown that this curse of dimensionality becomes milder if the covariates display the same low-dimensional structure as the target function, and a spiked covariates model is presented that can capture in a unified framework both behaviors observed in earlier work.

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

TLDR
Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum.

Statistical Optimality of Stochastic Gradient Descent on Hard Learning Problems through Multiple Passes

TLDR
Stochastic gradient descent for least-squares regression with potentially several passes with potentially infinite-dimensional models and notions typically associated to kernel methods is considered, namely, the decay of eigenvalues of the covariance matrix of the features and the complexity of the optimal predictor as measured through the covariances matrix.

Learning with invariances in random features and kernel models

TLDR
This work characterize the test error of invariant methods in a high-dimensional regime in which the sample size and number of hidden units scale as polynomials in the dimension, and shows that exploiting invariance in the architecture saves a d factor to achieve the same test error as for unstructured architectures.

Train faster, generalize better: Stability of stochastic gradient descent

We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically