• Corpus ID: 231925231

On the Last Iterate Convergence of Momentum Methods

  title={On the Last Iterate Convergence of Momentum Methods},
  author={Xiaoyun Li and Mingrui Liu and Francesco Orabona},
  booktitle={International Conference on Algorithmic Learning Theory},
SGD with Momentum (SGDM) is a widely used family of algorithms for large-scale optimization of machine learning problems. Yet, when optimizing generic convex functions, no advantage is known for any SGDM algorithm over plain SGD. Moreover, even the most recent results require changes to the SGDM algorithms, like averaging of the iterates and a projection onto a bounded domain, which are rarely used in practice. In this paper, we focus on the convergence rate of the last iterate of SGDM. For the… 

Figures and Tables from this paper

Momentum Provably Improves Error Feedback!

  • Ilyas FatkhullinAlexander TyurinPeter Richt'arik
  • Computer Science
  • 2023
A surprisingly simple fix is proposed which removes this issue both theoretically, and in practice: the application of Polyak's momentum to the latest incarnation of EF due to Richt\'a}rik et al.

High Probability Guarantees for Nonconvex Stochastic Gradient Descent with Heavy Tails

This paper develops high probability bounds for nonconvex SGD with a joint perspective of optimization and generalization performance, and shows that gradient clipping can be employed to remove the bounded gradient-type assumptions.

Anytime Online-to-Batch, Optimism and Acceleration

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

This work describes and analyze an apparatus for adaptively modifying the proximal function, which significantly simplifies setting a learning rate and results in regret guarantees that are provably as good as the best proximal functions that can be chosen in hindsight.

Understanding the Role of Momentum in Non-Convex Optimization: Practical Insights from a Lyapunov Analysis

A Lyapunov analysis of SGD with momentum (SGD+M), by utilizing a equivalent rewriting of the method known as the stochastic primal averaging (SPA) form, which is much tighter than previous theory in the non-convex case.

Problem Complexity and Method Efficiency in Optimization

Primal Averaging: A New Gradient Evaluation Step to Attain the Optimal Individual Convergence

It is proved that simply modifying the gradient operation step in MD by PA strategy suffices to recover the optimal individual rate for general convex problems, and it is extended to solve regularized nonsmooth learning problems in the stochastic setting, which reveals that PA strategy is a simple yet effective extra step toward the optimalindividual convergence of SGD.

Quasi-monotone Subgradient Methods for Nonsmooth Convex Minimization

These methods guarantee the best possible rate of convergence for the whole sequence of test points and are applicable as efficient real-time stabilization tools for potential systems with infinite horizon.

An optimal method for stochastic composite optimization

The accelerated stochastic approximation (AC-SA) algorithm based on Nesterov’s optimal method for smooth CP is introduced, and it is shown that the AC-SA algorithm can achieve the aforementioned lower bound on the rate of convergence for SCO.

Competing in the Dark: An Efficient Algorithm for Bandit Linear Optimization

This work introduces an efficient algorithm for the problem of online linear optimization in the bandit setting which achieves the optimal O∗( √ T ) regret and presents a novel connection between online learning and interior point methods.

Online learning: theory, algorithms and applications (למידה מקוונת.)

This dissertation describes a novel framework for the design and analysis of online learning algorithms and proposes a new perspective on regret bounds which is based on the notion of duality in convex optimization.

Adaptive Bound Optimization for Online Convex Optimization

This work introduces a new online convex optimization algorithm that adaptively chooses its regularization function based on the loss functions observed so far, and proves competitive guarantees that show the algorithm provides a bound within a constant factor of the best possible bound in hindsight in hindsight.