• Corpus ID: 235659009

High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails

@inproceedings{Cutkosky2021HighprobabilityBF,
  title={High-probability Bounds for Non-Convex Stochastic Optimization with Heavy Tails},
  author={Ashok Cutkosky and Harsh Mehta},
  booktitle={NeurIPS},
  year={2021}
}
We consider non-convex stochastic optimization using first-order algorithms for which the gradient estimates may have heavy tails. We show that a combination of gradient clipping, momentum, and normalized gradient descent yields convergence to critical points in high-probability with best-known rates for smooth losses when the gradients only have bounded p th moments for some p ∈ (1 , 2] . We then consider the case of second-order smooth losses, which to our knowledge have not been studied in… 

Figures from this paper

Clipped Stochastic Methods for Variational Inequalities with Heavy-Tailed Noise
TLDR
This work proves the first high-probability complexity results with logarithmic dependence on the confidence level for stochastic methods for solving monotone and structured non-monotone VIPs with non-sub-Gaussian (heavy-tailed) noise and unbounded domains.
Mirror Descent Strikes Again: Optimal Stochastic Convex Optimization under Infinite Noise Variance
TLDR
This work quantifies the convergence rate of the Stochastic Mirror Descent algorithm with a particular class of uniformly convex mirror maps, in terms of the number of iterations, dimensionality and related geometric parameters of the optimization problem.
High Probability Bounds for a Class of Nonconvex Algorithms with AdaGrad Stepsize
TLDR
This paper focuses on a particular accelerated gradient template (AGD) template (Lan, 2020), through which it recovers the original AdaGrad and its variant with averaging, and proves a convergence rate of O (1 / √ T ) with high probability without the knowledge of smoothness and variance.
Normalized/Clipped SGD with Perturbation for Differentially Private Non-Convex Optimization
TLDR
This paper studies two algorithms, DP-SGD and DP-NSGD, which clip or normalize per-sample gradients to bound the sensitivity and then add noise to obfuscate the exact information, and demonstrates that these two algorithms achieve similar best accuracy.
Provable Acceleration of Heavy Ball beyond Quadratics for a Class of Polyak-Łojasiewicz Functions when the Non-Convexity is Averaged-Out
TLDR
This work develops some new techniques that help show acceleration beyond quadratics, which is achieved by analyzing how the change of the Hessian at two consecutive time points affects the convergence speed.
Risk regularization through bidirectional dispersion
TLDR
This work studies a complementary new risk class that penalizes loss deviations in a bidirectional manner, while having more flexibility in terms of tail sensitivity than is offered by classical mean-variance, without sacrificing computational or analytical tractability.

References

SHOWING 1-10 OF 44 REFERENCES
Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance
TLDR
The results indicate that even under heavy-tailed noise with infinite variance, SGD can converge to the global optimum without necessitating any modification neither to the loss function nor to the algorithm itself, as typically required in robust statistics.
Robustness Analysis of Non-Convex Stochastic Gradient Descent using Biased Expectations
TLDR
It is shown that heavy-tailed noise on the gradient slows down the convergence of SGD without preventing it, proving that SGD is robust to gradient noise with unbounded variance, a setting of interest for Deep Learning.
Lower Bounds for Non-Convex Stochastic Optimization
TLDR
It is proved that (in the worst case) any algorithm requires at least $\epsilon^{-4}$ queries to find an stationary point, and establishes that stochastic gradient descent is minimax optimal in this model.
On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes
TLDR
This paper theoretically analyzes in the convex and non-convex settings a generalized version of the AdaGrad stepsizes, and shows sufficient conditions for these stepsizes to achieve almost sure asymptotic convergence of the gradients to zero, proving the first guarantee for generalized AdaGrad Stepsizes in the non- Convex setting.
Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping
TLDR
The first non-trivial high-probability complexity bounds for SGD with clipping without light-tails assumption on the noise are derived and derive for this method closing the gap in the theory of stochastic optimization with heavy-tailed noise.
Time-uniform, nonparametric, nonasymptotic confidence sequences
A confidence sequence is a sequence of confidence intervals that is uniformly valid over an unbounded time horizon. Our work develops confidence sequences whose widths go to zero, with nonasymptotic
On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks
TLDR
It is argued that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate and establish an explicit connection between the convergence rate of SGD to a local minimum and the tail-index $\alpha$.
Improved Analysis of Clipping Algorithms for Non-convex Optimization
TLDR
This paper presents a general framework to study the clipping algorithms, which also takes momentum methods into consideration, and provides convergence analysis of the framework in both deterministic and stochastic setting, and demonstrates the tightness of the results by comparing them with existing lower bounds.
Second-Order Information in Non-Convex Stochastic Optimization: Power and Limitations
TLDR
An algorithm which finds an $\epsilon$-approximate stationary point using stochastic gradient and Hessian-vector products is designed, and a lower bound is proved which establishes that this rate is optimal and that it cannot be improved using Stochastic $p$th order methods for any $p\ge 2$ even when the first $ p$ derivatives of the objective are Lipschitz.
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
...
...