• Corpus ID: 238408265

Linear Convergence of Generalized Mirror Descent with Time-Dependent Mirrors

@inproceedings{Radhakrishnan2020LinearCO,
  title={Linear Convergence of Generalized Mirror Descent with Time-Dependent Mirrors},
  author={Adityanarayanan Radhakrishnan and Mikhail Belkin and Caroline Uhler},
  year={2020}
}
The Polyak-Lojasiewicz (PL) inequality is a sufficient condition for establishing linear convergence of gradient descent, even in non-convex settings. While several recent works use a PL-based analysis to establish linear convergence of stochastic gradient descent methods, the question remains as to whether a similar analysis can be conducted for more general optimization methods. In this work, we present a PL-based analysis for linear convergence of generalized mirror descent (GMD), a… 

Figures from this paper

References

SHOWING 1-10 OF 22 REFERENCES
Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak-Łojasiewicz Condition
TLDR
This work shows that this much-older Polyak-Lojasiewicz (PL) inequality is actually weaker than the main conditions that have been explored to show linear convergence rates without strong convexity over the last 25 years, leading to simple proofs of linear convergence of these methods.
Stochastic Gradient/Mirror Descent: Minimax Optimality and Implicit Regularization
TLDR
It is argued how this identity can be used in the so-called "highly over-parameterized" nonlinear setting to provide insights into why SMD (and SGD) may have similar convergence and implicit regularization properties for deep learning.
Mirror descent and nonlinear projected subgradient methods for convex optimization
Linear Convergence of Adaptive Stochastic Gradient Descent
We prove that the norm version of the adaptive stochastic gradient method (AdaGrad-Norm) achieves a linear convergence rate for a subset of either strongly convex functions or non-convex functions
Stochastic Mirror Descent on Overparameterized Nonlinear Models.
TLDR
This article shows that for sufficiently- overparameterized nonlinear models, SMD with a (small enough) fixed step size converges to a global minimum that is ``very close'' (in Bregman divergence) to the minimum-potential interpolating solution, thus attaining approximate implicit regularization.
Fast and Faster Convergence of SGD for Over-Parameterized Models and an Accelerated Perceptron
TLDR
It is proved that constant step-size stochastic gradient descent (SGD) with Nesterov acceleration matches the convergence rate of the deterministic accelerated method for both convex and strongly-convex functions.
Learning Positive Functions with Pseudo Mirror Descent
TLDR
A novel algorithm is proposed that performs efficient estimation of positive functions within a Hilbert space without expensive projections and outperforms the state-of-the-art benchmarks for learning intensities of Poisson and multivariate Hawkes processes, in terms of both computational efficiency and accuracy.
AdaGrad stepsizes: sharp convergence over nonconvex landscapes
TLDR
The norm version of AdaGrad (AdaGrad-Norm) converges to a stationary point at the O(log(N)/ √ N) rate in the stochastic setting, and at the optimal O(1/N) rates in the batch (non-stochastic) setting – in this sense, the convergence guarantees are “sharp”.
Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning
TLDR
This work shows that optimization problems corresponding to over-parameterized systems of non-linear equations are not convex, even locally, but instead satisfy the Polyak-Lojasiewicz (PL) condition allowing for efficient optimization by gradient descent or SGD.
Overparameterized Nonlinear Learning: Gradient Descent Takes the Shortest Path?
TLDR
This paper demonstrates the utility of the general theory of (stochastic) gradient descent for a variety of problem domains spanning low-rank matrix recovery to neural network training and develops novel martingale techniques that guarantee SGD never leaves a small neighborhood of the initialization, even with rather large learning rates.
...
...