On the Last Iterate Convergence of Momentum Methods
@inproceedings{Li2021OnTL, title={On the Last Iterate Convergence of Momentum Methods}, author={Xiaoyun Li and Mingrui Liu and Francesco Orabona}, booktitle={International Conference on Algorithmic Learning Theory}, year={2021} }
SGD with Momentum (SGDM) is a widely used family of algorithms for large-scale optimization of machine learning problems. Yet, when optimizing generic convex functions, no advantage is known for any SGDM algorithm over plain SGD. Moreover, even the most recent results require changes to the SGDM algorithms, like averaging of the iterates and a projection onto a bounded domain, which are rarely used in practice. In this paper, we focus on the convergence rate of the last iterate of SGDM. For the…
2 Citations
Momentum Provably Improves Error Feedback!
- Computer Science
- 2023
A surprisingly simple fix is proposed which removes this issue both theoretically, and in practice: the application of Polyak's momentum to the latest incarnation of EF due to Richt\'a}rik et al.
High Probability Guarantees for Nonconvex Stochastic Gradient Descent with Heavy Tails
- Computer ScienceICML
- 2022
This paper develops high probability bounds for nonconvex SGD with a joint perspective of optimization and generalization performance, and shows that gradient clipping can be employed to remove the bounded gradient-type assumptions.
48 References
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
- Computer ScienceJ. Mach. Learn. Res.
- 2011
This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.
Understanding the Role of Momentum in Non-Convex Optimization: Practical Insights from a Lyapunov Analysis
- Computer ScienceArXiv
- 2020
A Lyapunov analysis of SGD with momentum (SGD+M), by utilizing a equivalent rewriting of the method known as the stochastic primal averaging (SPA) form, which is much tighter than previous theory in the non-convex case.
Primal Averaging: A New Gradient Evaluation Step to Attain the Optimal Individual Convergence
- Computer ScienceIEEE Transactions on Cybernetics
- 2020
It is proved that simply modifying the gradient operation step in MD by PA strategy suffices to recover the optimal individual rate for general convex problems, and it is extended to solve regularized nonsmooth learning problems in the stochastic setting, which reveals that PA strategy is a simple yet effective extra step toward the optimalindividual convergence of SGD.
Quasi-monotone Subgradient Methods for Nonsmooth Convex Minimization
- Mathematics, Computer ScienceJ. Optim. Theory Appl.
- 2015
These methods guarantee the best possible rate of convergence for the whole sequence of test points and are applicable as efficient real-time stabilization tools for potential systems with infinite horizon.
An optimal method for stochastic composite optimization
- Computer Science, MathematicsMath. Program.
- 2012
The accelerated stochastic approximation (AC-SA) algorithm based on Nesterov’s optimal method for smooth CP is introduced, and it is shown that the AC-SA algorithm can achieve the aforementioned lower bound on the rate of convergence for SCO.
Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization
- Computer ScienceCOLT
- 2008
This work introduces an efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal O∗( √ T ) regret and presents a novel connection between online learning and interior point methods.
Online learning: theory, algorithms and applications (למידה מקוונת.)
- Computer Science
- 2007
This dissertation describes a novel framework for the design and analysis of online learning algorithms and proposes a new perspective on regret bounds which is based on the notion of duality in convex optimization.
Adaptive Bound Optimization for Online Convex Optimization
- Computer ScienceCOLT
- 2010
This work introduces a new online convex optimization algorithm that adaptively chooses its regularization function based on the loss functions observed so far, and proves competitive guarantees that show the algorithm provides a bound within a constant factor of the best possible bound in hindsight in hindsight.