• Corpus ID: 220250109

Is SGD a Bayesian sampler? Well, almost

@article{Mingard2021IsSA,
  title={Is SGD a Bayesian sampler? Well, almost},
  author={Chris Mingard and Guillermo Valle P{\'e}rez and Joar Skalse and Ard A. Louis},
  journal={J. Mach. Learn. Res.},
  year={2021},
  volume={22},
  pages={79:1-79:64}
}
Overparameterised deep neural networks (DNNs) are highly expressive and so can, in principle, generate almost any function that fits a training dataset with zero error. The vast majority of these functions will perform poorly on unseen data, and yet in practice DNNs often generalise remarkably well. This success suggests that a trained DNN must have a strong inductive bias towards functions with low generalisation error. Here we empirically investigate this inductive bias by calculating, for a… 
Why Flatness Correlates With Generalization For Deep Neural Networks
TLDR
It is argued that local flatness measures correlate with generalization because they are local approximations to a global property, the volume of the set of parameters mapping to a specific function, equivalent to the Bayesian prior upon initialization.
Generalization bounds for deep learning
TLDR
Desiderata for techniques that predict generalization errors for deep learning models in supervised learning are introduced, and a marginal-likelihood PAC-Bayesian bound is derived that fulfills desiderata 1-3 and 5.
Investigating Generalization by Controlling Normalized Margin
TLDR
It is suggested that networks can be produced where normalized margin has seemingly no relationship with generalization, counter to the theory of Bartlett et al. (2017), and that yes — in a standard training setup, test performance closely tracks normalized margin.
Symmetry and simplicity spontaneously emerge from the algorithmic nature of evolution
TLDR
This algorithmic bias predicts a much higher prevalence of low-complexity (high-symmetry) phenotypes than follows from natural selection alone and also explains patterns observed in protein complexes, RNA secondary structures, and a gene regulatory network.
Optimal learning rate schedules in high-dimensional non-convex optimization problems
TLDR
This work presents the first analytical study of the role of learning rate scheduling in Langevin optimization with a learning rate decaying as η(t) = t−β, and focuses on the high-dimensional inference problem of retrieving a ground truth signal from observations via a noisy channel.
STATISTICAL INFERENCE WITH STOCHASTIC GRADIENT ALGORITHMS
TLDR
The theoretical results show that properly tuned stochastic gradient algorithms offer a practical approach to obtaining inferences that are computationally efficient and statistically robust.
Separation of scales and a thermodynamic description of feature learning in some CNNs
TLDR
It is shown that DNN layers couple only through the second moment (kernels) of their activations and pre-activations, which indicates a separation of scales occurring in fully trained over-parameterized deep convolutional neural networks (CNNs).
SGD Through the Lens of Kolmogorov Complexity
TLDR
Stochastic gradient descent is proved to be a solution that achieves (1 − (cid:15) ) classification accuracy on the entire dataset, and this work gives the first convergence guarantee for general, underparameterized models.
The Equilibrium Hypothesis: Rethinking implicit regularization in Deep Neural Networks
TLDR
The Equilibrium Hypothesis is introduced and empirically validate, which states that the layers that achieve some balance between forward and backward information loss are the ones with the highest alignment to data labels.
On the Implicit Biases of Architecture & Gradient Descent
TLDR
It is found that while typical networks that fit the training data already generalise fairly well, gradient descent can further improve generalisation by selecting networks with a large margin, based on a careful study of the behaviour of infinite width networkstrained by Bayesian inference and finite width networks trained by gradient descent.
...
...

References

SHOWING 1-10 OF 118 REFERENCES
Deep learning generalizes because the parameter-function map is biased towards simple functions
TLDR
This paper argues that the parameter-function map of many DNNs should be exponentially biased towards simple functions, and provides clear evidence for this strong simplicity bias in a model DNN for Boolean functions, as well as in much larger fully connected and convolutional networks applied to CIFAR10 and MNIST.
Bayesian Deep Learning and a Probabilistic Perspective of Generalization
TLDR
It is shown that deep ensembles provide an effective mechanism for approximate Bayesian marginalization, and a related approach is proposed that further improves the predictive distribution by marginalizing within basins of attraction, without significant overhead.
A Bayesian Perspective on Generalization and Stochastic Gradient Descent
TLDR
It is proposed that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large, and it is demonstrated that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy.
The promises and pitfalls of Stochastic Gradient Langevin Dynamics
TLDR
It is shown that SGLDFP gives approximate samples from the posterior distribution, with an accuracy comparable to the Langevin Monte Carlo (LMC) algorithm for a computational cost sublinear in the number of data points.
Predicting the outputs of finite networks trained with noisy gradients
TLDR
A DNN training protocol involving noise whose outcome is mappable to a certain non-Gaussian stochastic process and is able to predict the outputs of empirical finite networks with high accuracy, improving upon the accuracy of GP predictions by over an order of magnitude.
SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data
TLDR
This work proves convergence rates of SGD to a global minimum and provides generalization guarantees for this global minimum that are independent of the network size, and shows that SGD can avoid overfitting despite the high capacity of the model.
Sharp Minima Can Generalize For Deep Nets
TLDR
It is argued that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization, and when focusing on deep networks with rectifier units, the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit is exploited.
Bayesian Convolutional Neural Networks with Many Channels are Gaussian Processes
TLDR
This work derives an analogous equivalence for multi-layer convolutional neural networks both with and without pooling layers, and introduces a Monte Carlo method to estimate the GP corresponding to a given neural network architecture, even in cases where the analytic form has too many terms to be computationally feasible.
Energy–entropy competition and the effectiveness of stochastic gradient descent in machine learning
TLDR
A correspondence between parameter inference and free energy minimisation in statistical physics is derived and it is shown that the stochasticity in the SGD algorithm has a non-trivial correlation structure which systematically biases it towards wide minima.
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
TLDR
This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization.
...
...