Pegasos: primal estimated sub-gradient solver for SVM

  title={Pegasos: primal estimated sub-gradient solver for SVM},
  author={Shai Shalev-Shwartz and Yoram Singer and Nathan Srebro and Andrew Cotter},
  journal={Mathematical Programming},
We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy $${\epsilon}$$ is $${\tilde{O}(1 / \epsilon)}$$, where each iteration operates on a single training example. In contrast, previous analyses of stochastic gradient descent methods for SVMs require $${\Omega(1 / \epsilon^2)}$$ iterations. As in previously… 

Figures and Tables from this paper

SVM via Saddle Point Optimization: New Bounds and Distributed Algorithms

The first nearly linear time algorithm for $\nu$-SVM is provided, which improves the running time by a factor of $\sqrt{d}/\sqrt{\epsilon}$.

NESVM: A Fast Gradient Method for Support Vector Machines

NESVM, a fast gradient SVM solver that can optimize various SVM models, e.g., classical SVM, linear programming SVM and least square SVM is presented and the efficiency and the effectiveness of NESVM are suggested.

Streaming Complexity of SVMs

It is shown that, for both problems, for dimensions $d=1,2$, one can obtain streaming algorithms with space polynomially smaller than SGD for strongly convex functions like the bias-regularized SVM, and polynomial lower bounds for both point estimation and optimization are proved.

Stochastic Proximal Gradient Algorithm with Minibatches. Application to Large Scale Learning Models

This chapter develops and analyzes minibatch variants of stochastic proximal gradient algorithm for general composite objective functions with stochastically nonsmooth components and provides iteration complexity for constant and variable stepsize policies.

On the Complexity of Minimizing Convex Finite Sums Without Using the Indices of the Individual Functions

This work exploits the finite noise structure of finite sums to derive a matching $O(n^2)$-upper bound under the global oracle model, showing that this lower bound is indeed tight.

Weighted SGD for $\ell_p$ Regression with Randomized Preconditioning

It is proved that pwSGD inherits faster convergence rates that only depend on the lower dimension of the linear system, while maintaining low computation complexity, which is uniformly better than that of RLA methods in terms of both $\epsilon$ and d when the problem is unconstrained.

The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited

The stochastic (sub)gradient approach to the unconstrained primal L1-SVM optimization is reconsidered and a mechanism of presenting the same pattern repeatedly to the algorithm which maintains the above properties is proposed.

Strong error analysis for stochastic gradient descent optimization algorithms

A rigorous strong error analysis for SGD optimization algorithms is performed and it is proved that for every arbitrarily small $\varepsilon$ and every arbitrarily large $p\in (0,\infty)$ that the consideredSGD optimization algorithm converges in the strong L^p-sense.

Making the Last Iterate of SGD Information Theoretically Optimal

The main contribution of this work is to design new step size sequences that enjoy information theoretically optimal bounds on the suboptimality ofSGD as well as GD, by designing a modification scheme that converts one sequence of step sizes to another so that the last point of SGD/GD with modified sequence has the same sub optimality guarantees as the average of SGd/GDwith original sequence.

Optimal Finite-Sum Smooth Non-Convex Optimization with SARAH

This paper is the first to show that this lower bound is tight for the class of variance reduction methods which only assume the Lipschitz continuous gradient assumption, and proposes SARAH++ with sublinear convergence for general convex and linear convergence for strongly convex problems.



QP Algorithms with Guaranteed Accuracy and Run Time for Support Vector Machines

Operational conditions for which the Simon and composite algorithms possess an upper bound of O(n) on the number of iterations are described and general conditions forwhich a matching lower bound exists for any decomposition algorithm that uses working sets of size 2 are described.

Proximal regularization for online and batch learning

P proximal regularization is employed, in which the original learning problem is solved via a sequence of modified optimization tasks whose objectives are chosen to have greater curvature than the original problem.

Fast training of support vector machines using sequential minimal optimization, advances in kernel methods

SMO breaks this large quadratic programming problem into a series of smallest possible QP problems, which avoids using a time-consuming numerical QP optimization as an inner loop and hence SMO is fastest for linear SVMs and sparse data sets.

Logarithmic regret algorithms for online convex optimization

Several algorithms achieving logarithmic regret are proposed, which besides being more general are also much more efficient to implement, and give rise to an efficient algorithm based on the Newton method for optimization, a new tool in the field.

Online learning with kernels

This paper considers online learning in a reproducing kernel Hilbert space, and allows the exploitation of the kernel trick in an online setting, and examines the value of large margins for classification in the online setting with a drifting target.

Primal-dual subgradient methods for convex problems

  • Y. Nesterov
  • Mathematics, Computer Science
    Math. Program.
  • 2009
A new approach for constructing subgradient schemes for different types of nonsmooth problems with convex structure that is primal-dual since they are always able to generate a feasible approximation to the optimum of an appropriately formulated dual problem.

On the generalization ability of on-line learning algorithms

This paper proves tight data-dependent bounds for the risk of this hypothesis in terms of an easily computable statistic M/sub n/ associated with the on-line performance of the ensemble, and obtains risk tail bounds for kernel perceptron algorithms interms of the spectrum of the empirical kernel matrix.

Efficient SVM Training Using Low-Rank Kernel Representations

This work shows that for a low rank kernel matrix it is possible to design a better interior point method (IPM) in terms of storage requirements as well as computational complexity and derives an upper bound on the change in the objective function value based on the approximation error and the number of active constraints (support vectors).

A dual coordinate descent method for large-scale linear SVM

A novel dual coordinate descent method for linear SVM with L1-and L2-loss functions that reaches an ε-accurate solution in O(log(1/ε)) iterations is presented.

Fast Rates for Regularized Objectives

It is shown that the value attained by the empirical minimizer converges to the optimal value with rate 1/n, which is essential for obtaining certain type of oracle inequalities for SVMs.