# Pegasos: primal estimated sub-gradient solver for SVM

@article{ShalevShwartz2007PegasosPE,
title={Pegasos: primal estimated sub-gradient solver for SVM},
author={Shai Shalev-Shwartz and Yoram Singer and Nathan Srebro and Andrew Cotter},
journal={Mathematical Programming},
year={2007},
volume={127},
pages={3-30}
}
• Published 20 June 2007
• Computer Science
• Mathematical Programming
We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy $${\epsilon}$$ is $${\tilde{O}(1 / \epsilon)}$$, where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require $${\Omega(1 / \epsilon^2)}$$ iterations. As in previously…
2,216 Citations

## Figures and Tables from this paper

### SVM via Saddle Point Optimization: New Bounds and Distributed Algorithms

• Computer Science
SWAT
• 2018
The first nearly linear time algorithm for $\nu$-SVM is provided, which improves the running time by a factor of $\sqrt{d}/\sqrt{\epsilon}$.

### NESVM: A Fast Gradient Method for Support Vector Machines

• Computer Science
2010 IEEE International Conference on Data Mining
• 2010
NESVM, a fast gradient SVM solver that can optimize various SVM models, e.g., classical SVM, linear programming SVM and least square SVM is presented and the efficiency and the effectiveness of NESVM are suggested.

### Streaming Complexity of SVMs

• Computer Science
APPROX-RANDOM
• 2020
It is shown that, for both problems, for dimensions $d=1,2$, one can obtain streaming algorithms with space polynomially smaller than SGD for strongly convex functions like the bias-regularized SVM, and polynomial lower bounds for both point estimation and optimization are proved.

### Stochastic Proximal Gradient Algorithm with Minibatches. Application to Large Scale Learning Models

• Computer Science
ArXiv
• 2020
This chapter develops and analyzes minibatch variants of stochastic proximal gradient algorithm for general composite objective functions with stochastically nonsmooth components and provides iteration complexity for constant and variable stepsize policies.

### On the Complexity of Minimizing Convex Finite Sums Without Using the Indices of the Individual Functions

• Computer Science, Mathematics
ArXiv
• 2020
This work exploits the finite noise structure of finite sums to derive a matching $O(n^2)$-upper bound under the global oracle model, showing that this lower bound is indeed tight.

### Weighted SGD for $\ell_p$ Regression with Randomized Preconditioning

• Computer Science
J. Mach. Learn. Res.
• 2017
It is proved that pwSGD inherits faster convergence rates that only depend on the lower dimension of the linear system, while maintaining low computation complexity, which is uniformly better than that of RLA methods in terms of both $\epsilon$ and d when the problem is unconstrained.

### The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited

• Computer Science
ECML/PKDD
• 2013
The stochastic (sub)gradient approach to the unconstrained primal L1-SVM optimization is reconsidered and a mechanism of presenting the same pattern repeatedly to the algorithm which maintains the above properties is proposed.

### Strong error analysis for stochastic gradient descent optimization algorithms

• Computer Science, Mathematics
• 2018
A rigorous strong error analysis for SGD optimization algorithms is performed and it is proved that for every arbitrarily small $\varepsilon$ and every arbitrarily large $p\in (0,\infty)$ that the consideredSGD optimization algorithm converges in the strong L^p-sense.

### Making the Last Iterate of SGD Information Theoretically Optimal

• Computer Science
COLT
• 2019
The main contribution of this work is to design new step size sequences that enjoy information theoretically optimal bounds on the suboptimality ofSGD as well as GD, by designing a modification scheme that converts one sequence of step sizes to another so that the last point of SGD/GD with modified sequence has the same sub optimality guarantees as the average of SGd/GDwith original sequence.

### Optimal Finite-Sum Smooth Non-Convex Optimization with SARAH

• Computer Science, Mathematics
ArXiv
• 2019
This paper is the first to show that this lower bound is tight for the class of variance reduction methods which only assume the Lipschitz continuous gradient assumption, and proposes SARAH++ with sublinear convergence for general convex and linear convergence for strongly convex problems.

## References

SHOWING 1-10 OF 39 REFERENCES

### QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines

• Computer Science, Mathematics
J. Mach. Learn. Res.
• 2006
Operational conditions for which the Simon and composite algorithms possess an upper bound of O(n) on the number of iterations are described and general conditions forwhich a matching lower bound exists for any decomposition algorithm that uses working sets of size 2 are described.

### Proximal regularization for online and batch learning

• Computer Science
ICML '09
• 2009
P proximal regularization is employed, in which the original learning problem is solved via a sequence of modified optimization tasks whose objectives are chosen to have greater curvature than the original problem.

### Fast training of support vector machines using sequential minimal optimization, advances in kernel methods

SMO breaks this large quadratic programming problem into a series of smallest possible QP problems, which avoids using a time-consuming numerical QP optimization as an inner loop and hence SMO is fastest for linear SVMs and sparse data sets.

### Logarithmic regret algorithms for online convex optimization

• Computer Science
Machine Learning
• 2007
Several algorithms achieving logarithmic regret are proposed, which besides being more general are also much more efficient to implement, and give rise to an efficient algorithm based on the Newton method for optimization, a new tool in the field.

### Online learning with kernels

• Computer Science
IEEE Transactions on Signal Processing
• 2004
This paper considers online learning in a reproducing kernel Hilbert space, and allows the exploitation of the kernel trick in an online setting, and examines the value of large margins for classification in the online setting with a drifting target.

### Primal-dual subgradient methods for convex problems

• Y. Nesterov
• Mathematics, Computer Science
Math. Program.
• 2009
A new approach for constructing subgradient schemes for different types of nonsmooth problems with convex structure that is primal-dual since they are always able to generate a feasible approximation to the optimum of an appropriately formulated dual problem.

### On the generalization ability of on-line learning algorithms

• Computer Science
IEEE Transactions on Information Theory
• 2004
This paper proves tight data-dependent bounds for the risk of this hypothesis in terms of an easily computable statistic M/sub n/ associated with the on-line performance of the ensemble, and obtains risk tail bounds for kernel perceptron algorithms interms of the spectrum of the empirical kernel matrix.

### Efficient SVM Training Using Low-Rank Kernel Representations

• Computer Science
J. Mach. Learn. Res.
• 2001
This work shows that for a low rank kernel matrix it is possible to design a better interior point method (IPM) in terms of storage requirements as well as computational complexity and derives an upper bound on the change in the objective function value based on the approximation error and the number of active constraints (support vectors).

### A dual coordinate descent method for large-scale linear SVM

• Computer Science
ICML '08
• 2008
A novel dual coordinate descent method for linear SVM with L1-and L2-loss functions that reaches an ε-accurate solution in O(log(1/ε)) iterations is presented.

### Fast Rates for Regularized Objectives

• Mathematics
NIPS
• 2008
It is shown that the value attained by the empirical minimizer converges to the optimal value with rate 1/n, which is essential for obtaining certain type of oracle inequalities for SVMs.