• Corpus ID: 235658735

The Convergence Rate of SGD's Final Iterate: Analysis on Dimension Dependence

  title={The Convergence Rate of SGD's Final Iterate: Analysis on Dimension Dependence},
  author={Daogao Liu and Zhou Lu},
Stochastic Gradient Descent (SGD) is among the simplest and most popular methods in optimization. The convergence rate for SGD has been extensively studied and tight analyses have been established for the running average scheme, but the sub-optimality of the final iterate is still not well-understood. Shamir and Zhang [2013] gave the best known upper bound for the final iterate of SGD minimizing non-smooth convex functions, which is O(log T/ √ T ) for Lipschitz convex functions and O(log T/T… 

Figures and Tables from this paper



Open Problem: Tight Convergence of SGD in Constant Dimension

A gap is pointed out between the known upper and lower bounds for the expected suboptimality of the last SGD point whenever the dimension is a constant independent of the number of SGD iterations ) , and in particular, that the gap is still unaddressed even in the one dimensional case.

Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes

The performance of SGD without non-trivial smoothness assumptions is investigated, as well as a running average scheme to convert the SGD iterates to a solution with optimal optimization accuracy, and a new and simple averaging scheme is proposed which not only attains optimal rates, but can also be easily computed on-the-fly.

Tight Analyses for Non-Smooth Stochastic Gradient Descent

It is proved that after $T$ steps of stochastic gradient descent, the error of the final iterate is $O(\log(T)/T)$ with high probability, and there exists a function from this class for which the errors of the last iterate of deterministic gradient descent is $\Omega(\log (T)/\sqrt{T})$.

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

This paper investigates the optimality of SGD in a stochastic setting, and shows that for smooth problems, the algorithm attains the optimal O(1/T) rate, however, for non-smooth problems the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis.

Making the Last Iterate of SGD Information Theoretically Optimal

The main contribution of this work is to design new step size sequences that enjoy information theoretically optimal bounds on the suboptimality ofSGD as well as GD, by designing a modification scheme that converts one sequence of step sizes to another so that the last point of SGD/GD with modified sequence has the same sub optimality guarantees as the average of SGd/GDwith original sequence.

Minimizing finite sums with the stochastic average gradient

Numerical experiments indicate that the new SAG method often dramatically outperforms existing SG and deterministic gradient methods, and that the performance may be further improved through the use of non-uniform sampling strategies.

On the generalization ability of on-line learning algorithms

This paper proves tight data-dependent bounds for the risk of this hypothesis in terms of an easily computable statistic M/sub n/ associated with the on-line performance of the ensemble, and obtains risk tail bounds for kernel perceptron algorithms interms of the spectrum of the empirical kernel matrix.

Robust Stochastic Approximation Approach to Stochastic Programming

It is intended to demonstrate that a properly modified SA approach can be competitive and even significantly outperform the SAA method for a certain class of convex stochastic problems.

Pegasos: primal estimated sub-gradient solver for SVM

A simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines, which is particularly well suited for large text classification problems, and demonstrates an order-of-magnitude speedup over previous SVM learning methods.

On the Generalization Ability of Online Strongly Convex Programming Algorithms

A sharp bound is held on the excess risk of the output of an online algorithm in terms of the average regret, that allows one to use recent algorithms with logarithmic cumulative regret guarantees to achieve fast convergence rates for the excessrisk with high probability.