# Pegasos: primal estimated sub-gradient solver for SVM

@article{ShalevShwartz2011PegasosPE, title={Pegasos: primal estimated sub-gradient solver for SVM}, author={Shai Shalev-Shwartz and Yoram Singer and Nathan Srebro and Andrew Cotter}, journal={Mathematical Programming}, year={2011}, volume={127}, pages={3-30} }

We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy $${\epsilon}$$ is $${\tilde{O}(1 / \epsilon)}$$, where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require $${\Omega(1 / \epsilon^2)}$$ iterations. As in previously… Expand

#### Topics from this paper

#### 1,904 Citations

SVM via Saddle Point Optimization: New Bounds and Distributed Algorithms

- Mathematics, Computer Science
- SWAT
- 2018

The first nearly linear time algorithm for $\nu$-SVM is provided, which improves the running time by a factor of $\sqrt{d}/\sqrt{\epsilon}$. Expand

Streaming Complexity of SVMs

- Computer Science, Mathematics
- APPROX-RANDOM
- 2020

It is shown that, for both problems, for dimensions $d=1,2$, one can obtain streaming algorithms with space polynomially smaller than SGD for strongly convex functions like the bias-regularized SVM, and polynomial lower bounds for both point estimation and optimization are proved. Expand

Stochastic Proximal Gradient Algorithm with Minibatches. Application to Large Scale Learning Models

- Computer Science, Mathematics
- ArXiv
- 2020

This chapter develops and analyzes minibatch variants of stochastic proximal gradient algorithm for general composite objective functions with stochastically nonsmooth components and provides iteration complexity for constant and variable stepsize policies. Expand

On the Complexity of Minimizing Convex Finite Sums Without Using the Indices of the Individual Functions

- Computer Science, Mathematics
- ArXiv
- 2020

This work exploits the finite noise structure of finite sums to derive a matching $O(n^2)$-upper bound under the global oracle model, showing that this lower bound is indeed tight. Expand

The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited

- Computer Science, Mathematics
- ECML/PKDD
- 2013

The stochastic (sub)gradient approach to the unconstrained primal L1-SVM optimization is reconsidered and a mechanism of presenting the same pattern repeatedly to the algorithm which maintains the above properties is proposed. Expand

Making the Last Iterate of SGD Information Theoretically Optimal

- Computer Science, Mathematics
- COLT
- 2019

The main contribution of this work is to design new step size sequences that enjoy information theoretically optimal bounds on the suboptimality ofSGD as well as GD, by designing a modification scheme that converts one sequence of step sizes to another so that the last point of SGD/GD with modified sequence has the same sub optimality guarantees as the average of SGd/GDwith original sequence. Expand

Strong error analysis for stochastic gradient descent optimization algorithms

- Mathematics
- 2018

Stochastic gradient descent (SGD) optimization algorithms are key ingredients in a series of machine learning applications. In this article we perform a rigorous strong error analysis for SGD… Expand

Optimal Finite-Sum Smooth Non-Convex Optimization with SARAH

- Computer Science, Mathematics
- ArXiv
- 2019

This paper is the first to show that this lower bound is tight for the class of variance reduction methods which only assume the Lipschitz continuous gradient assumption, and proposes SARAH++ with sublinear convergence for general convex and linear convergence for strongly convex problems. Expand

The Lingering of Gradients: How to Reuse Gradients Over Time

- Computer Science, Mathematics
- NeurIPS
- 2018

This paper study a more refined complexity by taking into account the "lingering" of gradients: once a gradient is computed at $x_k$, the additional time to compute gradients at x_k may be reduced, which improves the running time of gradient descent and SVRG. Expand

The Lingering of Gradients: Theory and Applications

- Mathematics
- 2019

Classically, the time complexity of a first-order method is estimated by its number of gradient computations. In this paper, we study a more refined complexity by taking into account the `lingering'… Expand

#### References

SHOWING 1-10 OF 36 REFERENCES

Pegasos: Primal Estimated sub-GrAdient SOlver for SVM

- Mathematics, Computer Science
- ICML '07
- 2007

A simple and effective iterative algorithm for solving the optimization problem cast by Support Vector Machines that alternates between stochastic gradient descent steps and projection steps that can seamlessly be adapted to employ non-linear kernels while working solely on the primal objective function. Expand

QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2006

Operational conditions for which the Simon and composite algorithms possess an upper bound of O(n) on the number of iterations are described and general conditions forwhich a matching lower bound exists for any decomposition algorithm that uses working sets of size 2 are described. Expand

Proximal regularization for online and batch learning

- Mathematics, Computer Science
- ICML '09
- 2009

P proximal regularization is employed, in which the original learning problem is solved via a sequence of modified optimization tasks whose objectives are chosen to have greater curvature than the original problem. Expand

Fast training of support vector machines using sequential minimal optimization, advances in kernel methods

- Mathematics, Computer Science
- 1999

SMO breaks this large quadratic programming problem into a series of smallest possible QP problems, which avoids using a time-consuming numerical QP optimization as an inner loop and hence SMO is fastest for linear SVMs and sparse data sets. Expand

Logarithmic Regret Algorithms for Online Convex Optimization

- Computer Science
- COLT
- 2006

This paper proposes several algorithms achieving logarithmic regret, which besides being more general are also much more efficient to implement, and gives an efficient algorithm based on the Newton method for optimization, a new tool in the field. Expand

Online learning with kernels

- Mathematics, Computer Science
- IEEE Transactions on Signal Processing
- 2004

This paper considers online learning in a reproducing kernel Hilbert space, and allows the exploitation of the kernel trick in an online setting, and examines the value of large margins for classification in the online setting with a drifting target. Expand

Primal-dual subgradient methods for convex problems

- Computer Science, Mathematics
- Math. Program.
- 2009

A new approach for constructing subgradient schemes for different types of nonsmooth problems with convex structure that is primal-dual since they are always able to generate a feasible approximation to the optimum of an appropriately formulated dual problem. Expand

On the generalization ability of on-line learning algorithms

- Computer Science, Mathematics
- IEEE Transactions on Information Theory
- 2004

This paper proves tight data-dependent bounds for the risk of this hypothesis in terms of an easily computable statistic M/sub n/ associated with the on-line performance of the ensemble, and obtains risk tail bounds for kernel perceptron algorithms interms of the spectrum of the empirical kernel matrix. Expand

Efficient SVM Training Using Low-Rank Kernel Representations

- Computer Science, Mathematics
- J. Mach. Learn. Res.
- 2001

This work shows that for a low rank kernel matrix it is possible to design a better interior point method (IPM) in terms of storage requirements as well as computational complexity and derives an upper bound on the change in the objective function value based on the approximation error and the number of active constraints (support vectors). Expand

A dual coordinate descent method for large-scale linear SVM

- Computer Science
- ICML '08
- 2008

A novel dual coordinate descent method for linear SVM with L1-and L2-loss functions that reaches an ε-accurate solution in O(log(1/ε)) iterations is presented. Expand