# Stability & Generalisation of Gradient Descent for Shallow Neural Networks without the Neural Tangent Kernel

@article{Richards2021StabilityG, title={Stability \& Generalisation of Gradient Descent for Shallow Neural Networks without the Neural Tangent Kernel}, author={Dominic Richards and Ilja Kuzborskij}, journal={ArXiv}, year={2021}, volume={abs/2107.12723} }

We revisit on-average algorithmic stability of Gradient Descent (GD) for training overparameterised shallow neural networks and prove new generalisation and excess risk bounds without the Neural Tangent Kernel (NTK) or Polyak-Łojasiewicz (PL) assumptions. In particular, we show oracle type bounds which reveal that the generalisation and excess risk of GD is controlled by an interpolating network with the shortest GD path from initialisation (in a sense, an interpolating network with the… Expand

#### Figures from this paper

#### One Citation

Nonparametric Regression with Shallow Overparameterized Neural Networks Trained by GD with Early Stopping

- Computer Science, Mathematics
- COLT
- 2021

This work explores the ability of overparameterized shallow neural networks to learn Lipschitz regression functions with and without label noise when trained by Gradient Descent and proposes an early stopping rule that allows them to show optimal rates. Expand

#### References

SHOWING 1-10 OF 32 REFERENCES

Train faster, generalize better: Stability of stochastic gradient descent

- Computer Science, Mathematics
- ICML
- 2016

We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically… Expand

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

- Computer Science, Mathematics
- ICLR
- 2019

Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. Expand

Stability and Generalization of Learning Algorithms that Converge to Global Optima

- Mathematics, Computer Science
- ICML
- 2018

This work derives black-box stability results that only depend on the convergence of a learning algorithm and the geometry around the minimizers of the loss function that establish novel generalization bounds for learning algorithms that converge to global minima. Expand

Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks

- Computer Science, Mathematics
- ICLR
- 2020

This paper investigates the training of over-parametrized neural networks that are beyond the NTK regime yet still governed by the Taylor expansion of the network, and demonstrates that the randomization technique can be generalized systematically beyond the quadratic case. Expand

Data-Dependent Stability of Stochastic Gradient Descent

- Computer Science, Mathematics
- ICML
- 2018

A data-dependent notion of algorithmic stability for Stochastic Gradient Descent is established, and novel generalization bounds are developed that exhibit fast convergence rates for SGD subject to a vanishing empirical risk and low noise of stochastic gradient. Expand

Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods

- Computer Science, Mathematics
- ICLR
- 2021

It is shown that any linear estimator can be outperformed by deep learning in a sense of the minimax optimal rate especially for a high dimension setting and so-called fast learning rate is obtained. Expand

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

- Computer Science, Mathematics
- NeurIPS
- 2019

This work shows that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Expand

Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks

- Computer Science, Mathematics
- IEEE Journal on Selected Areas in Information Theory
- 2020

Focusing on shallow neural nets and smooth activations, it is shown that (stochastic) gradient descent when initialized at random converges at a geometric rate to a nearby global optima as soon as the square-root of the number of network parameters exceeds the size of the training data. Expand

A Convergence Theory for Deep Learning via Over-Parameterization

- Computer Science, Mathematics
- ICML
- 2019

This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. Expand

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

- Computer Science, Mathematics
- ICML
- 2019

This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure. Expand