# Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

@article{Rakhlin2012MakingGD, title={Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization}, author={Alexander Rakhlin and Ohad Shamir and Karthik Sridharan}, journal={ArXiv}, year={2012}, volume={abs/1109.5647} }

Stochastic gradient descent (SGD) is a simple and popular method to solve stochastic optimization problems which arise in machine learning. For strongly convex problems, its convergence rate was known to be O(log(T)/T), by running SGD for T iterations and returning the average point. However, recent results showed that using a different algorithm, one can get an optimal O(1/T) rate. This might lead one to believe that standard SGD is suboptimal, and maybe should even be replaced as a method of…

## 567 Citations

Stochastic Gradient Descent for Non-smooth Optimization: Convergence Results and Optimal Averaging Schemes

- Computer ScienceICML
- 2013

The performance of SGD without non-trivial smoothness assumptions is investigated, as well as a running average scheme to convert the SGD iterates to a solution with optimal optimization accuracy, and a new and simple averaging scheme is proposed which not only attains optimal rates, but can also be easily computed on-the-fly.

Open Problem: Is Averaging Needed for Strongly Convex Stochastic Gradient Descent?

- Computer ScienceCOLT
- 2012

The question is whether averaging is needed at all to get optimal rates of stochastic gradient descent, and the algorithm makes use of an oracle, which gives a random vector ^ whose expectation is a subgradient ofF (w).

Stochastic Algorithm with Optimal Convergence Rate for Strongly Convex Optimization Problems

- Computer Science
- 2014

A weighted algorithm based on COMID is presented, to keep the sparsity imposed by the L1 regularization term, and a prove is provided to show that it achieves an O(1/T) convergence rate.

Stochastic Algorithm with Optimal Convergence Rate for Strongly Convex Optimization Problems

- Computer Science
- 2014

A weighted algorithm based on COMID is presented, to keep the sparsity imposed by the L1 regularization term, and a prove is provided to show that it achieves an O(1/T) convergence rate.

Efficient Stochastic Gradient Descent for Strongly Convex Optimization

- Computer ScienceArXiv
- 2013

An epoch-projection SGD method that only makes a constant number of projections less than $\log_2T$ but achieves an optimal convergence rate for strongly convex optimization and a proximal extension to utilize the structure of the objective function that could further speed up the computation and convergence for sparse regularized loss minimization problems.

2019 2 Problem Setup and Main Results Consider the following optimization problem

- Computer Science
- 2019

This work designs a modification scheme, that converts one sequence of step sizes to another so that the last point of SGD/GD with modified sequence has the same suboptimality guarantees as the average ofSGD/ GD with original sequence, and shows that this result holds with high-probability.

Stochastic Learning via Optimizing the Variational Inequalities

- Computer ScienceIEEE Transactions on Neural Networks and Learning Systems
- 2014

The proposed stochastic ADMM (SADMM) is proved to have an O(1/t) VI-convergence rate for the l1-regularized hinge loss problems without strong convexity and smoothness and a new VI-criterion is defined to measure the convergence of Stochastic algorithms.

Tight Analyses for Non-Smooth Stochastic Gradient Descent

- Computer Science, MathematicsCOLT
- 2019

It is proved that after $T$ steps of stochastic gradient descent, the error of the final iterate is $O(\log(T)/T)$ with high probability, and there exists a function from this class for which the errors of the last iterate of deterministic gradient descent is $\Omega(\log (T)/\sqrt{T})$.

Understanding the role of averaging in non-smooth stochastic gradient descent

- Computer Science, Mathematics
- 2020

It is proved that after T steps of stochastic gradient descent (SGD), the error of the final iterate of deterministic gradient descent is O(log(T )/T ) with high probability, and there exists a function for which this happens, and the results are proven using a generalization of Freedman’s Inequality.

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections

- Computer ScienceUAI
- 2016

This work considers stochastic strongly convex optimization with a complex inequality constraint, and proposes an Epoch-Projection Stochastic Gradient Descent~(Epro-SGD) method, namely Epro-ORDA, based on the optimal regularized dual averaging method.

## References

SHOWING 1-10 OF 17 REFERENCES

Stochastic Convex Optimization

- Computer ScienceCOLT
- 2009

Stochastic convex optimization is studied, and it is shown that the key ingredient is strong convexity and regularization, which is only a sufficient, but not necessary, condition for meaningful non-trivial learnability.

Robust Stochastic Approximation Approach to Stochastic Programming

- Computer Science, MathematicsSIAM J. Optim.
- 2009

It is intended to demonstrate that a properly modified SA approach can be competitive and even significantly outperform the SAA method for a certain class of convex stochastic problems.

Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning

- Computer Science, MathematicsNIPS
- 2011

This work provides a non-asymptotic analysis of the convergence of two well-known algorithms, stochastic gradient descent as well as a simple modification where iterates are averaged, suggesting that a learning rate proportional to the inverse of the number of iterations, while leading to the optimal convergence rate, is not robust to the lack of strong convexity or the setting of the proportionality constant.

Pegasos: primal estimated sub-gradient solver for SVM

- Computer ScienceICML '07
- 2007

A simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines, which is particularly well suited for large text classification problems, and demonstrates an order-of-magnitude speedup over previous SVM learning methods.

Primal-dual subgradient methods for minimizing uniformly convex functions

- Mathematics, Computer Science
- 2010

Accuracy bounds for the performance of non-Euclidean deterministic and stochastic algorithms and design methods which are adaptive with respect to the parameters of strong or uniform convexity of the objective are provided.

High-Probability Regret Bounds for Bandit Online Linear Optimization

- Computer Science, MathematicsCOLT
- 2008

This paper eliminates the gap between the high-probability bounds obtained in the full-information vs bandit settings, and improves on the previous algorithm [8] whose regret is bounded in expectation against an oblivious adversary.

A General Class of Exponential Inequalities for Martingales and Ratios

- Mathematics
- 1999

In this paper we introduce a technique for obtaining exponential inequalities, with particular emphasis placed on results involving ratios. Our main applications consist of approximations to the tail…

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

- Computer ScienceCOLT
- 2011

An algorithm which performs only gradient updates with optimal rate of convergence is given, which is equivalent to stochastic convex optimization with a strongly convex objective.

Stochastic Approximation and Recursive Algorithms and Applications

- Mathematics
- 2003

Introduction 1 Review of Continuous Time Models 1.1 Martingales and Martingale Inequalities 1.2 Stochastic Integration 1.3 Stochastic Differential Equations: Diffusions 1.4 Reflected Diffusions 1.5…

Training linear SVMs in linear time

- Computer ScienceKDD '06
- 2006

A Cutting Plane Algorithm for training linear SVMs that provably has training time 0(s,n) for classification problems and o(sn log (n)) for ordinal regression problems and several orders of magnitude faster than decomposition methods like svm light for large datasets.