Optimisation & Generalisation in Networks of Neurons

  title={Optimisation \& Generalisation in Networks of Neurons},
  author={Jeremy Bernstein},
The goal of this thesis is to develop the optimisation and generalisation theoretic foundations of learning in artificial neural networks. On optimisation, a new theoretical framework is proposed for deriving architecture-dependent first-order optimisation algorithms. The approach works by combining a"functional majorisation"of the loss function with"architectural perturbation bounds"that encode an explicit dependence on neural architecture. The framework yields optimisation methods that… 



Deep Neural Networks as Gaussian Processes

The exact equivalence between infinitely wide deep networks and GPs is derived and it is found that test performance increases as finite-width trained networks are made wider and more similar to a GP, and thus that GP predictions typically outperform those of finite- width networks.

Learning by Turning: Neural Architecture Aware Optimisation

A combined study of neural architecture and optimisation is conducted, leading to a new optimiser called Nero: the neuronal rotator, which trains reliably without momentum or weight decay, works in situations where Adam and SGD fail, and requires little to no learning rate tuning.

Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

PAC-Bayesian Generalisation Error Bounds for Gaussian Process Classification

  • M. Seeger
  • Computer Science
    J. Mach. Learn. Res.
  • 2002
By applying the PAC-Bayesian theorem of McAllester (1999a), this paper proves distribution-free generalisation error bounds for a wide range of approximate Bayesian GP classification techniques, giving a strong learning-theoretical justification for the use of these techniques.

Generalization bounds for deep learning

Desiderata for techniques that predict generalization errors for deep learning models in supervised learning are introduced, and a marginal-likelihood PAC-Bayesian bound is derived that fulfills desiderata 1-3 and 5.

Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability

It is empirically demonstrated that full-batch gradient descent on neural network training objectives typically operates in a regime the authors call the Edge of Stability, which is inconsistent with several widespread presumptions in the field of optimization.

Understanding deep learning requires rethinking generalization

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.

Revisiting Natural Gradient for Deep Networks

It is described how one can use unlabeled data to improve the generalization error obtained by natural gradient and empirically evaluate the robustness of the algorithm to the ordering of the training set compared to stochastic gradient descent.

Gaussian Processes and Kernel Methods: A Review on Connections and Equivalences

This paper is an attempt to bridge the conceptual gaps between researchers working on the two widely used approaches based on positive definite kernels: Bayesian learning or inference using Gaussian

Natural Gradient Works Efficiently in Learning

  • S. Amari
  • Computer Science
    Neural Computation
  • 1998
The dynamical behavior of natural gradient online learning is analyzed and is proved to be Fisher efficient, implying that it has asymptotically the same performance as the optimal batch estimation of parameters.