Corpus ID: 236447687

Stability & Generalisation of Gradient Descent for Shallow Neural Networks without the Neural Tangent Kernel

  title={Stability \& Generalisation of Gradient Descent for Shallow Neural Networks without the Neural Tangent Kernel},
  author={Dominic Richards and Ilja Kuzborskij},
We revisit on-average algorithmic stability of Gradient Descent (GD) for training overparameterised shallow neural networks and prove new generalisation and excess risk bounds without the Neural Tangent Kernel (NTK) or Polyak-Łojasiewicz (PL) assumptions. In particular, we show oracle type bounds which reveal that the generalisation and excess risk of GD is controlled by an interpolating network with the shortest GD path from initialisation (in a sense, an interpolating network with the… Expand

Figures from this paper

Nonparametric Regression with Shallow Overparameterized Neural Networks Trained by GD with Early Stopping
This work explores the ability of overparameterized shallow neural networks to learn Lipschitz regression functions with and without label noise when trained by Gradient Descent and proposes an early stopping rule that allows them to show optimal rates. Expand


Train faster, generalize better: Stability of stochastic gradient descent
We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmicallyExpand
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum. Expand
Stability and Generalization of Learning Algorithms that Converge to Global Optima
This work derives black-box stability results that only depend on the convergence of a learning algorithm and the geometry around the minimizers of the loss function that establish novel generalization bounds for learning algorithms that converge to global minima. Expand
Beyond Linearization: On Quadratic and Higher-Order Approximation of Wide Neural Networks
This paper investigates the training of over-parametrized neural networks that are beyond the NTK regime yet still governed by the Taylor expansion of the network, and demonstrates that the randomization technique can be generalized systematically beyond the quadratic case. Expand
Data-Dependent Stability of Stochastic Gradient Descent
A data-dependent notion of algorithmic stability for Stochastic Gradient Descent is established, and novel generalization bounds are developed that exhibit fast convergence rates for SGD subject to a vanishing empirical risk and low noise of stochastic gradient. Expand
Benefit of deep learning with non-convex noisy gradient descent: Provable excess risk bound and superiority to kernel methods
It is shown that any linear estimator can be outperformed by deep learning in a sense of the minimax optimal rate especially for a high dimension setting and so-called fast learning rate is obtained. Expand
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
This work shows that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters. Expand
Toward Moderate Overparameterization: Global Convergence Guarantees for Training Shallow Neural Networks
Focusing on shallow neural nets and smooth activations, it is shown that (stochastic) gradient descent when initialized at random converges at a geometric rate to a nearby global optima as soon as the square-root of the number of network parameters exceeds the size of the training data. Expand
A Convergence Theory for Deep Learning via Over-Parameterization
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. Expand
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure. Expand