• Corpus ID: 222140873

Improved Analysis of Clipping Algorithms for Non-convex Optimization

@article{Zhang2020ImprovedAO,
  title={Improved Analysis of Clipping Algorithms for Non-convex Optimization},
  author={Bohang Zhang and Jikai Jin and Cong Fang and Liwei Wang},
  journal={ArXiv},
  year={2020},
  volume={abs/2010.02519}
}
Gradient clipping is commonly used in training deep neural networks partly due to its practicability in relieving the exploding gradient problem. Recently, \citet{zhang2019gradient} show that clipped (stochastic) Gradient Descent (GD) converges faster than vanilla GD/SGD via introducing a new assumption called $(L_0, L_1)$-smoothness, which characterizes the violent fluctuation of gradients typically encountered in deep neural networks. However, their iteration complexities on the problem… 

Figures and Tables from this paper

Stability and Convergence of Stochastic Gradient Clipping: Beyond Lipschitz Continuity and Smoothness

TLDR
Qualitative and quantitative convergence results of the clipped stochastic (sub)gradient method (SGD) for non-smooth convex functions with rapidly growing subgradients are established and the proposed method achieves the best-known rate for the considered class of problems.

Recent Theoretical Advances in Non-Convex Optimization

TLDR
An overview of recent theoretical results on global performance guarantees of optimization algorithms for non-convex optimization and a list of problems that can be solved efficiently to find the global minimizer by exploiting the structure of the problem as much as it is possible.

Robustness to Unbounded Smoothness of Generalized SignSGD

TLDR
This paper theoretically proves that a generalized SignSGD algorithm can obtain similar convergence rates as SGD with clipping but does not need explicit clipping at all, and compares these algorithms with popular optimizers on a set of deep learning tasks, observing that they can match the performance of Adam while beating the others.

A Communication-Efficient Distributed Gradient Clipping Algorithm for Training Deep Neural Networks

TLDR
A relaxed-smoothness assumption of the loss landscape which LSTM was shown to satisfy in previous works is explored and a communication-efficient gradient clipping algorithm is designed that exhibits fast convergence speed in practice and thus validates the theory.

Normalized/Clipped SGD with Perturbation for Differentially Private Non-Convex Optimization

TLDR
This paper studies two algorithms, DP-SGD and DP-NSGD, which clip or normalize per-sample gradients to bound the sensitivity and then add noise to obfuscate the exact information, and demonstrates that these two algorithms achieve similar best accuracy.

An Adam Convergence Analysis

  • 2022

RBGNet: Ray-based Grouping for 3D Object Detection

TLDR
This paper proposes the RBGNet framework, a voting-based 3D detector for accurate 3D object detection from point clouds, and proposes a ray-based feature grouping module, which aggregates the point-wise features on object surfaces using a group of determined rays uniformly emitted from cluster centers.

Non-convex Distributionally Robust Optimization: Non-asymptotic Analysis

TLDR
This work proves that a special algorithm called the mini-batch normalized gradient descent with momentum, can find an -first-order stationary point within O( −4) gradient complexity and proposes a penalized DRO objective based on a smoothed version of the CVaR that allows us to obtain a similar convergence guarantee.

Stochastic Training is Not Necessary for Generalization

TLDR
It is demonstrated that non-stochastic full-batch training can achieve strong performance on CIFAR-10 that is on-par with SGD, using modern architectures in settings with and without data augmentation.

High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails

TLDR
It is shown that after a suitable “burn-in” period, the objective value will monotonically decrease whenever the current iterate is not a critical point, which provides intuition behind the popular practice of learning rate “warm-up” and also yields a last-iterate guarantee.

References

SHOWING 1-10 OF 35 REFERENCES

Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity

TLDR
It is shown that gradient smoothness, a concept central to the analysis of first-order optimization algorithms that is often assumed to be a constant, demonstrates significant variability along the training trajectory of deep neural networks, and positively correlates with the gradient norm, and contrary to standard assumptions in the literature.

Can gradient clipping mitigate label noise?

TLDR
It is proved that for the common problem of label noise in classification, standard gradient clipping does not in general provide robustness, and it is shown that a simple variant of gradient clipping is provably robust, and corresponds to suitably modifying the underlying loss function.

Understanding Gradient Clipping in Private SGD: A Geometric Perspective

TLDR
This work demonstrates how gradient clipping can prevent SGD from converging to stationary point and provides a theoretical analysis that fully quantifies the clipping bias on convergence with a disparity measure between the gradient distribution and a geometrically symmetric distribution.

Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping

TLDR
The first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise are derived and derive for this method closing the gap in the theory of stochastic optimization with heavy-tailed noise.

Why are Adaptive Methods Good for Attention Models?

TLDR
Empirical and theoretical evidence is provided that a heavy-tailed distribution of the noise in stochastic gradients is one cause of SGD's poor performance and the first tight upper and lower convergence bounds for adaptive gradient methods underheavy-tailed noise are provided.

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

TLDR
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

Adam: A Method for Stochastic Optimization

TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

Scaling up Differentially Private Deep Learning with Fast Per-Example Gradient Clipping

TLDR
New methods for per-example gradient clipping that are compatible with auto-differeniation and provide better GPU utilization are derived by analyzing the back-propagation equations of Renyi Differential Privacy.

Why ADAM Beats SGD for Attention Models

TLDR
Empirical and theoretical evidence is provided that a heavy-tailed distribution of the noise in stochastic gradients is a root cause of SGD's poor performance and a new adaptive coordinate-wise clipping algorithm (ACClip) tailored to such settings is developed.

An Alternative View: When Does SGD Escape Local Minima?

TLDR
SGD will not get stuck at "sharp" local minima with small diameters, as long as the neighborhoods of these regions contain enough gradient information, which helps explain why SGD works so well for neural networks.