Corpus ID: 238215172

Stochastic Training is Not Necessary for Generalization

@article{Geiping2021StochasticTI,
  title={Stochastic Training is Not Necessary for Generalization},
  author={Jonas Geiping and Micah Goldblum and Phillip E. Pope and Michael Moeller and Tom Goldstein},
  journal={ArXiv},
  year={2021},
  volume={abs/2109.14119}
}
It is widely believed that the implicit regularization of stochastic gradient descent (SGD) is fundamental to the impressive generalization behavior we observe in neural networks. In this work, we demonstrate that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD, using modern architectures in settings with and without data augmentation. To this end, we utilize modified hyperparameters and show that the implicit regularization of SGD can be… Expand

Figures and Tables from this paper

On the Implicit Biases of Architecture & Gradient Descent
TLDR
It is found that while typical networks that fit the training data already generalise fairly well, gradient descent can further improve generalisation by selecting networks with a large margin, based on a careful study of the behaviour of infinite width networkstrained by Bayesian inference and finite width networks trained by gradient descent. Expand

References

SHOWING 1-10 OF 74 REFERENCES
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
TLDR
This work investigates the cause for this generalization drop in the large-batch regime and presents numerical evidence that supports the view that large- batch methods tend to converge to sharp minimizers of the training and testing functions - and as is well known, sharp minima lead to poorer generalization. Expand
On the Generalization Benefit of Noise in Stochastic Gradient Descent
TLDR
This paper performs carefully designed experiments and rigorous hyperparameter sweeps on a range of popular models, which verify that small or moderately large batch sizes can substantially outperform very large batches on the test set. Expand
Towards Theoretical Understanding of Large Batch Training in Stochastic Gradient Descent
TLDR
It is proved that SGD tends to converge to flatter minima in the asymptotic regime (although may take exponential time to converge) regardless of the batch size, and thatSGD with a larger ratio of learning rate to batch size tends to convergence to a flat minimum faster, however, its generalization performance could be worse. Expand
Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks
TLDR
It is proved that SGD minimizes an average potential over the posterior distribution of weights along with an entropic regularization term, and that the most likely trajectories of SGD for deep networks do not behave like Brownian motion around critical points, but resemble closed loops with deterministic components. Expand
The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning
TLDR
The key observation is that most modern learning architectures are over-parametrized and are trained to interpolate the data by driving the empirical loss close to zero, so it is still unclear why these interpolated solutions perform well on test data. Expand
On the Origin of Implicit Regularization in Stochastic Gradient Descent
TLDR
It is proved that for SGD with random shuffling, the mean SGD iterate also stays close to the path of gradient flow if the learning rate is small and finite, but on a modified loss. Expand
A Bayesian Perspective on Generalization and Stochastic Gradient Descent
TLDR
It is proposed that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large, and it is demonstrated that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy. Expand
Sharp Minima Can Generalize For Deep Nets
TLDR
It is argued that most notions of flatness are problematic for deep models and can not be directly applied to explain generalization, and when focusing on deep networks with rectifier units, the particular geometry of parameter space induced by the inherent symmetries that these architectures exhibit is exploited. Expand
Train longer, generalize better: closing the generalization gap in large batch training of neural networks
TLDR
This work proposes a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior and presents a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates. Expand
Large Batch Optimization for Deep Learning: Training BERT in 76 minutes
TLDR
The empirical results demonstrate the superior performance of LAMB across various tasks such as BERT and ResNet-50 training with very little hyperparameter tuning, and the optimizer enables use of very large batch sizes of 32868 without any degradation of performance. Expand
...
1
2
3
4
5
...