Corpus ID: 32711926

The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning

@inproceedings{Ma2018ThePO,
  title={The Power of Interpolation: Understanding the Effectiveness of SGD in Modern Over-parametrized Learning},
  author={Siyuan Ma and Raef Bassily and Mikhail Belkin},
  booktitle={ICML},
  year={2018}
}
In this paper we aim to formally explain the phenomenon of fast convergence of SGD observed in modern machine learning. [...] Key ResultFinally, we show how our results fit in the recent developments in training deep neural networks and discuss connections to adaptive rates for SGD and variance reduction.Expand
On exponential convergence of SGD in non-convex over-parametrized learning
TLDR
It is argued that the PL condition provides a relevant and attractive setting for many machine learning problems, particularly in the over-parametrized regime. Expand
Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron
TLDR
It is proved that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions. Expand
Last iterate convergence of SGD for Least-Squares in the Interpolation regime
TLDR
This work studies the noiseless model in the fundamental least-squares setup and gives explicit non-asymptotic convergence rates in the over-parameterized setting and leverage a fine-grained parameterization of the problem to exhibit polynomial rates that can be faster than $O(1/T)$. Expand
SGD: The Role of Implicit Regularization, Batch-size and Multiple-epochs
TLDR
This paper considers the problem of SCO and explores the role of implicit regularization, batch size and multiple epochs for SGD, and extends the results to the general learning setting by showing a problem which is learnable for any data distribution, and SGD is strictly better than RERM for any regularization function. Expand
Accelerating SGD with momentum for over-parameterized learning
TLDR
MaSS is introduced, it is proved that MaSS obtains an accelerated convergence rates over SGD for any mini-batch size in the linear setting, and the practically important question of the dependence of the convergence rate and optimal hyper-parameters on the mini- batch size is analyzed. Expand
Extrapolation for Large-batch Training in Deep Learning
TLDR
This work proposes to use computationally efficient extrapolation (extragradient) to stabilize the optimization trajectory while still benefiting from smoothing to avoid sharp minima, and proves the convergence of this novel scheme and rigorously evaluates its empirical performance on ResNet, LSTM, and Transformer. Expand
Training Neural Networks for and by Interpolation
TLDR
The majority of modern deep learning models are able to interpolate the data: the empirical loss can be driven near zero on all samples simultaneously and this property is exploited for the design of a new optimization algorithm for deep learning. Expand
Accelerating Stochastic Training for Over-parametrized Learning
We introduce MaSS (Momentum-added Stochastic Solver), an accelerated SGD method for optimizing over-parametrized models. Our method is simple and efficient to implement and does not require adaptingExpand
A N ON-ASYMPTOTIC COMPARISON OF SVRG AND SGD : TRADEOFFS BETWEEN COMPUTE AND SPEED
Stochastic gradient descent (SGD), which trades off noisy gradient updates for computational efficiency, is the de-facto optimization algorithm to solve largescale machine learning problems. SGD canExpand
Overparameterized Nonlinear Optimization with Applications to Neural Nets
  • Samet Oymak
  • Computer Science
  • 2019 13th International conference on Sampling Theory and Applications (SampTA)
  • 2019
TLDR
This talk shows that solution found by first order methods, such as gradient descent, has the property that it has near shortest distance to the initialization of the algorithm among all other solutions, and advocates that shortest distance property can be a good proxy for the simplest explanation. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 45 REFERENCES
To understand deep learning we need to understand kernel learning
TLDR
It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods. Expand
Diving into the shallows: a computational perspective on large-scale shallow learning
TLDR
EigenPro iteration is introduced, based on a preconditioning scheme using a small number of approximately computed eigenvectors, which turns out that injecting this small (computationally inexpensive and SGD-compatible) amount of approximate second-order information leads to major improvements in convergence. Expand
Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
TLDR
This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape and compares favorably to state-of-the-art techniques in terms of generalization error and training time. Expand
Empirical Analysis of the Hessian of Over-Parametrized Neural Networks
TLDR
A case that links the two observations: small and large batch gradient descent appear to converge to different basins of attraction but are in fact connected through their flat region and so belong to the same basin. Expand
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
TLDR
This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency. Expand
Don't Decay the Learning Rate, Increase the Batch Size
TLDR
This procedure is successful for stochastic gradient descent, SGD with momentum, Nesterov momentum, and Adam, and reaches equivalent test accuracies after the same number of training epochs, but with fewer parameter updates, leading to greater parallelism and shorter training times. Expand
Train longer, generalize better: closing the generalization gap in large batch training of neural networks
TLDR
This work proposes a "random walk on random landscape" statistical model which is known to exhibit similar "ultra-slow" diffusion behavior and presents a novel algorithm named "Ghost Batch Normalization" which enables significant decrease in the generalization gap without increasing the number of updates. Expand
Scaling SGD Batch Size to 32K for ImageNet Training
TLDR
Layer-wise Adaptive Rate Scaling (LARS) is proposed, a method to enable large-batch training to general networks or datasets, and it can scale the batch size to 32768 for ResNet50 and 8192 for AlexNet. Expand
Optimization Methods for Large-Scale Machine Learning
TLDR
A major theme of this study is that large-scale machine learning represents a distinctive setting in which the stochastic gradient method has traditionally played a central role while conventional gradient-based nonlinear optimization techniques typically falter, leading to a discussion about the next generation of optimization methods for large- scale machine learning. Expand
An Analysis of Deep Neural Network Models for Practical Applications
TLDR
This work presents a comprehensive analysis of important metrics in practical applications: accuracy, memory footprint, parameters, operations count, inference time and power consumption and believes it provides a compelling set of information that helps design and engineer efficient DNNs. Expand
...
1
2
3
4
5
...