Corpus ID: 232075817

Noisy Truncated SGD: Optimization and Generalization

@article{Zhou2021NoisyTS,
  title={Noisy Truncated SGD: Optimization and Generalization},
  author={Yingxue Zhou and Xinyan Li and Arindam Banerjee},
  journal={ArXiv},
  year={2021},
  volume={abs/2103.00075}
}
Recent empirical work on SGD applied to over-parameterized deep learning has shown that most gradient components over epochs are quite small. Inspired by such observations, we rigorously study properties of noisy truncated SGD (NT-SGD), a noisy gradient descent algorithm that truncates (hard thresholds) the majority of small gradient components to zeros and then adds Gaussian noise to all components. Considering non-convex smooth problems, we first establish the rate of convergence of NT-SGD in… Expand
2 Citations

Figures and Tables from this paper

Descending through a Crowded Valley - Benchmarking Deep Learning Optimizers
TLDR
An extensive, standardized benchmark of more than a dozen particularly popular deep learning optimizers is performed, identifying a significantly reduced subset of specific algorithms and parameter choices that generally provided competitive results in the authors' experiments. Expand

References

SHOWING 1-10 OF 63 REFERENCES
The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects
TLDR
This work studies a general form of gradient based optimization dynamics with unbiased noise, which unifies SGD and standard Langevin dynamics, and shows that the anisotropic noise in SGD helps to escape from sharp and poor minima effectively, towards more stable and flat minima that typically generalize well. Expand
Train faster, generalize better: Stability of stochastic gradient descent
We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmicallyExpand
The Convergence of Sparsified Gradient Methods
TLDR
It is proved that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD. Expand
On Generalization Error Bounds of Noisy Gradient Methods for Non-Convex Learning
TLDR
A new framework, termed Bayes-Stability, is developed for proving algorithm-dependent generalization error bounds for learning general non-convex objectives and it is demonstrated that the data-dependent bounds can distinguish randomly labelled data from normal data. Expand
Adaptive Methods for Nonconvex Optimization
TLDR
The result implies that increasing minibatch sizes enables convergence, thus providing a way to circumvent the non-convergence issues, and provides a new adaptive optimization algorithm, Yogi, which controls the increase in effective learning rate, leading to even better performance with similar theoretical guarantees on convergence. Expand
signSGD: compressed optimisation for non-convex problems
TLDR
SignSGD can get the best of both worlds: compressed gradients and SGD-level convergence rate, and the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models. Expand
Generalization Bounds of SGLD for Non-convex Learning: Two Theoretical Viewpoints
TLDR
This is the first algorithm-dependent result with reasonable dependence on aggregated step sizes for non-convex learning, and has important implications to statistical learning aspects of stochastic gradient methods in complicated models such as deep learning. Expand
Information-Theoretic Generalization Bounds for SGLD via Data-Dependent Estimates
TLDR
This work improves upon the stepwise analysis of noisy iterative learning algorithms and significantly improved mutual information bounds for Stochastic Gradient Langevin Dynamics via data-dependent estimates via variational characterization of mutual information. Expand
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand
On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points
TLDR
Perturbed versions of GD and SGD are analyzed and it is shown that they are truly efficient---their dimension dependence is only polylogarithmic. Expand
...
1
2
3
4
5
...