# The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

@inproceedings{Ma2018ThePO, title={The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning}, author={Siyuan Ma and Raef Bassily and Mikhail Belkin}, booktitle={ICML}, year={2018} }

In this paper we aim to formally explain the phenomenon of fast convergence of SGD observed in modern machine learning. [...] Key ResultFinally, we show how our results fit in the recent developments in training deep neural networks and discuss connections to adaptive rates for SGD and variance reduction. Expand

#### 151 Citations

On exponential convergence of SGD in non-convex over-parametrized learning

- Computer Science, Mathematics
- ArXiv
- 2018

It is argued that the PL condition provides a relevant and attractive setting for many machine learning problems, particularly in the over-parametrized regime. Expand

Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron

- Computer Science, Mathematics
- AISTATS
- 2019

It is proved that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions. Expand

Last iterate convergence of SGD for Least-Squares in the Interpolation regime

- Computer Science, Mathematics
- ArXiv
- 2021

This work studies the noiseless model in the fundamental least-squares setup and gives explicit non-asymptotic convergence rates in the over-parameterized setting and leverage a fine-grained parameterization of the problem to exhibit polynomial rates that can be faster than $O(1/T)$. Expand

SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs

- Computer Science
- ArXiv
- 2021

This paper considers the problem of SCO and explores the role of implicit regularization, batch size and multiple epochs for SGD, and extends the results to the general learning setting by showing a problem which is learnable for any data distribution, and SGD is strictly better than RERM for any regularization function. Expand

Accelerating SGD with momentum for over-parameterized learning

- Computer Science
- ICLR
- 2020

MaSS is introduced, it is proved that MaSS obtains an accelerated convergence rates over SGD for any mini-batch size in the linear setting, and the practically important question of the dependence of the convergence rate and optimal hyper-parameters on the mini- batch size is analyzed. Expand

Extrapolation for Large-batch Training in Deep Learning

- Computer Science, Mathematics
- ICML
- 2020

This work proposes to use computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory while still benefiting from smoothing to avoid sharp minima, and proves the convergence of this novel scheme and rigorously evaluates its empirical performance on ResNet, LSTM, and Transformer. Expand

Training Neural Networks for and by Interpolation

- Computer Science, Mathematics
- ICML
- 2020

The majority of modern deep learning models are able to interpolate the data: the empirical loss can be driven near zero on all samples simultaneously and this property is exploited for the design of a new optimization algorithm for deep learning. Expand

Accelerating Stochastic Training for Over-parametrized Learning

- Mathematics
- 2018

We introduce MaSS (Momentum-added Stochastic Solver), an accelerated SGD method for optimizing over-parametrized models. Our method is simple and efficient to implement and does not require adapting… Expand

A N ON-ASYMPTOTIC COMPARISON OF SVRG AND SGD : TRADEOFFS BETWEEN COMPUTE AND SPEED

- 2019

Stochastic gradient descent (SGD), which trades off noisy gradient updates for computational efficiency, is the de-facto optimization algorithm to solve largescale machine learning problems. SGD can… Expand

Overparameterized Nonlinear Optimization with Applications to Neural Nets

- Computer Science
- 2019 13th International conference on Sampling Theory and Applications (SampTA)
- 2019

This talk shows that solution found by first order methods, such as gradient descent, has the property that it has near shortest distance to the initialization of the algorithm among all other solutions, and advocates that shortest distance property can be a good proxy for the simplest explanation. Expand

#### References

SHOWING 1-10 OF 45 REFERENCES

To understand deep learning we need to understand kernel learning

- Computer Science, Mathematics
- ICML
- 2018

It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods. Expand

Diving into the shallows: a computational perspective on large-scale shallow learning

- Computer Science, Mathematics
- NIPS
- 2017

EigenPro iteration is introduced, based on a preconditioning scheme using a small number of approximately computed eigenvectors, which turns out that injecting this small (computationally inexpensive and SGD-compatible) amount of approximate second-order information leads to major improvements in convergence. Expand

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

- Computer Science, Mathematics
- ICLR
- 2017

This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape and compares favorably to state-of-the-art techniques in terms of generalization error and training time. Expand

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

- Computer Science, Mathematics
- ICLR
- 2018

A case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but are in fact connected through their flat region and so belong to the same basin. Expand

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

- Computer Science
- ArXiv
- 2017

This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency. Expand

Don't Decay the Learning Rate, Increase the Batch Size

- Computer Science, Mathematics
- ICLR
- 2018

This procedure is successful for stochastic gradient descent, SGD with momentum, Nesterov momentum, and Adam, and reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. Expand

Train longer, generalize better: closing the generalization gap in large batch training of neural networks

- Mathematics, Computer Science
- NIPS
- 2017

This work proposes a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior and presents a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates. Expand

Scaling SGD Batch Size to 32K for ImageNet Training

- Computer Science
- ArXiv
- 2017

Layer-wise Adaptive Rate Scaling (LARS) is proposed, a method to enable large-batch training to general networks or datasets, and it can scale the batch size to 32768 for ResNet50 and 8192 for AlexNet. Expand

Optimization Methods for Large-Scale Machine Learning

- Computer Science, Mathematics
- SIAM Rev.
- 2018

A major theme of this study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter, leading to a discussion about the next generation of optimization methods for large- scale machine learning. Expand

An Analysis of Deep Neural Network Models for Practical Applications

- Computer Science
- ArXiv
- 2016

This work presents a comprehensive analysis of important metrics in practical applications: accuracy, memory footprint, parameters, operations count, inference time and power consumption and believes it provides a compelling set of information that helps design and engineer efficient DNNs. Expand