• Corpus ID: 203737226

The Complexity of Finding Stationary Points with Stochastic Gradient Descent

@inproceedings{Drori2020TheCO,
  title={The Complexity of Finding Stationary Points with Stochastic Gradient Descent},
  author={Yoel Drori and Ohad Shamir},
  booktitle={ICML},
  year={2020}
}
We study the iteration complexity of stochastic gradient descent (SGD) for minimizing the gradient norm of smooth, possibly nonconvex functions. We provide several results, implying that the classical $\mathcal{O}(\epsilon^{-4})$ upper bound (for making the average gradient norm less than $\epsilon$) cannot be improved upon, unless a combination of additional assumptions is made. Notably, this holds even if we limit ourselves to convex quadratic functions. We also show that for nonconvex… 

Figures from this paper

Computing the Variance of Shuffling Stochastic Gradient Algorithms via Power Spectral Density Analysis
TLDR
The analysis extends beyond SGD to SGD with momentum and to the stochastic Nesterov’s accelerated gradient method, and performs experiments on quadratic objective functions to test the validity of the approximation and the correctness of the findings.
Learning Halfspaces with Massart Noise Under Structured Distributions
TLDR
This work identifies a smooth {\em non-convex} surrogate loss with the property that any approximate stationary point of this loss defines a halfspace that is close to the target halfspace, and can be used to solve the underlying learning problem.
Adaptive Gradient Descent for Convex and Non-Convex Stochastic Optimization
TLDR
These algorithms are based on Armijo-type line search and they simultaneously adapt to the unknown Lipschitz constant of the gradient and variance of the stochastic approximation for the gradient.
STOCHASTIC GRADIENT DESCENT
TLDR
This paper develops a broad condition on the sequence of examples used by SGD that is sufficient to prove tight convergence rates in both strongly convex and non-convex settings, and proposes two new example-selection approaches using quasi-Monte-Carlo methods.
Branch-and-Bound Performance Estimation Programming: A Unified Methodology for Constructing Optimal Optimization Methods
TLDR
The BnB-PEP methodology is applied to several setups for which the prior methodologies do not apply and obtain methods with bounds that improve upon prior state-of-the-art results, thereby systematically generating analytical convergence proofs.
Tight Convergence Rates of the GradientMethod on Hypoconvex Functions
We perform the first tight convergence analysis of the gradient method with fixed step sizes applied to the class of smooth hypoconvex (weakly-convex) functions, i.e., smooth nonconvex functions
Tight convergence rates of the gradient method on smooth hypoconvex functions
We perform the first tight convergence analysis of the gradient method with varying step sizes when applied to smooth hypoconvex (weakly convex) functions. Hypoconvex functions are smooth nonconvex
Latency considerations for stochastic optimizers in variational quantum algorithms
TLDR
Stochastic optimization algorithms that yield stochastic processes em-ulating the dynamics of classical deterministic algorithms results in methods with theoretically superior worst-case iteration complexities, at the expense of greater per-iteration sample (shot) complexities.
Exact Optimal Accelerated Complexity for Fixed-Point Iterations
Despite the broad use of fixed-point iterations throughout applied mathematics, the optimal convergence rate of general fixed-point problems with nonexpansive nonlinear operators has not been
...
...

References

SHOWING 1-10 OF 27 REFERENCES
Sharp Analysis for Nonconvex SGD Escaping from Saddle Points
TLDR
A sharp analysis for Stochastic Gradient Descent is given and it is proved that SGD is able to efficiently escape from saddle points and find an approximate second-order stationary point in $\tilde{O}(\epsilon^{-3.5}))$ stochastic gradient computations for generic nonconvex optimization problems, when the objective function satisfies gradient-Lipschitz, Hessian-Lipitz, and dispersive noise assumptions.
How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD
TLDR
If $f(x)$ is convex, to find its $\varepsilon$-approximate local minimum, the original SGD does not give an optimal rate, so this work designs an algorithm SGD3 with a near-optimal rate, improving the best known rate $O(\varpsilon^{-8/3})$.
Lower Bounds for Non-Convex Stochastic Optimization
TLDR
It is proved that (in the worst case) any algorithm requires at least $\epsilon^{-4}$ queries to find an stationary point, and establishes that stochastic gradient descent is minimax optimal in this model.
Stochastic Approximation and Recursive Algorithms and Applications
Introduction 1 Review of Continuous Time Models 1.1 Martingales and Martingale Inequalities 1.2 Stochastic Integration 1.3 Stochastic Differential Equations: Diffusions 1.4 Reflected Diffusions 1.5
The Complexity of Making the Gradient Small in Stochastic Convex Optimization
TLDR
It is shown that in the global oracle/statistical learning model, only logarithmic dependence on smoothness is required to find a near-stationary point, whereas polynomial dependence on Smoothness is necessary in the local stochastic oracle model.
Problem Complexity and Method Efficiency in Optimization
Convergence and efficiency of subgradient methods for quasiconvex minimization
TLDR
The general subgradient projection method for minimizing a quasiconvex objective subject to a convex set constraint in a Hilbert space is studied, finding ε-solutions with an efficiency estimate of O(ε-2), thus being optimal in the sense of Nemirovskii.
Introductory Lectures on Convex Optimization - A Basic Course
TLDR
It was in the middle of the 1980s, when the seminal paper by Kar markar opened a new epoch in nonlinear optimization, and it became more and more common that the new methods were provided with a complexity analysis, which was considered a better justification of their efficiency than computational experiments.
On the Gap Between Strict-Saddles and True Convexity: An Omega(log d) Lower Bound for Eigenvector Approximation
TLDR
A lower bound on query complexity on rank-one principal component analysis (PCA) is proved by developing a "truncated" analogue of the $\chi^2$ Bayes-risk lower bound of Chen et al.
...
...