• Corpus ID: 703528

Scalable Kernel Methods via Doubly Stochastic Gradients

@article{Dai2014ScalableKM,
  title={Scalable Kernel Methods via Doubly Stochastic Gradients},
  author={Bo Dai and Bo Xie and Niao He and Yingyu Liang and Anant Raj and Maria-Florina Balcan and Le Song},
  journal={ArXiv},
  year={2014},
  volume={abs/1407.5599}
}
The general perception is that kernel methods are not scalable, so neural nets become the choice for large-scale nonlinear learning problems. Have we tried hard enough for kernel methods? In this paper, we propose an approach that scales up kernel methods using a novel concept called "doubly stochastic functional gradients". Based on the fact that many kernel methods can be expressed as convex optimization problems, our approach solves the optimization problems by making two unbiased stochastic… 

Figures and Tables from this paper

Asynchronous Doubly Stochastic Sparse Kernel Learning
TLDR
The experimental results on various large-scale real-world datasets show that, the AsyDSSKL method has the significant superiority on the computational efficiency at the training and predicting steps over the existing kernel methods.
Parsimonious Online Learning with Kernels via sparse projections in function space
TLDR
Stochastic nonparametric regression problems in a reproducing kernel Hilbert space (RKHS) are considered, an extension of expected risk minimization to nonlinear function estimation and the use of functional stochastic gradient method is considered.
Scaling Up Generalized Kernel Methods
TLDR
The experimental results on various large-scale real-world datasets show that, the AsyDSSKL method has the significant superiority on the computational efficiency at the training and predicting steps over the existing kernel methods.
Scalable Kernel Ordinal Regression via Doubly Stochastic Gradients
TLDR
A novel DSG-like algorithm, DSGOR, which can achieve $O(1/t)$ convergence rate, which is as good as DSG, even though dealing with a much harder problem.
Utilize Old Coordinates: Faster Doubly Stochastic Gradients for Kernel Methods
TLDR
Two algorithms are proposed to remedy the scalability issue of kernel methods by "utilizing" old random features instead of adding new features in certain iterations, and the resulting procedure is surprisingly simple without enhancing the complexity of the original algorithm but effective in practice.
Triply Stochastic Gradients on Multiple Kernel Learning
TLDR
The triply Stochastic Gradient Descent (triply SGD) algorithm, a novel extension to doubly SGD for MKL, is developed which involves three sources of randomness – the data points, the random features, and the kernels, which was not considered in previous work.
Nonparametric Compositional Stochastic Optimization for Risk-Sensitive Kernel Learning
TLDR
This work develops the first memory-efficient stochastic algorithm for this setting, and provides, for the first time, a non-asymptotic tradeoff between the complexity of a function parameterization and its required convergence accuracy for both strongly convex and non-convex objectives under constant step-sizes.
Random Features Methods in Supervised Learning
TLDR
The fast learning rate of random Fourier features corresponding to the Gaussian kernel, with the number of features far less than the sample size justifies the computational advantage of random features over kernel methods from the theoretical aspect.
Generalization Properties of Doubly Online Learning Algorithms
How to Scale Up Kernel Methods to Be As Good As Deep Neural Nets
TLDR
This work develops methods to scale up kernel models to successfully tackle large-scale learning problems that are so far only approachable by deep learning architectures, and conducts extensive empirical studies on problems from image recognition and automatic speech recognition.
...
...

References

SHOWING 1-10 OF 48 REFERENCES
Online learning with kernels
TLDR
This paper considers online learning in a reproducing kernel Hilbert space, and allows the exploitation of the kernel trick in an online setting, and examines the value of large margins for classification in the online setting with a drifting target.
Kernel Conjugate Gradient for Fast Kernel Machines
TLDR
A novel variant of the conjugate gradient algorithm, Kernel Conjugate Gradient (KCG), designed to speed up learning for kernel machines with differentiable loss functions is proposed, and it is found it consistently, and significantly, outperforms existing techniques.
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization
TLDR
This paper investigates the optimality of SGD in a stochastic setting, and shows that for smooth problems, the algorithm attains the optimal O(1/T) rate, however, for non-smooth problems the convergence rate with averaging might really be Ω(log(T)/T), and this is not just an artifact of the analysis.
Fast and scalable polynomial kernels via explicit feature maps
TLDR
A novel randomized tensor product technique, called Tensor Sketching, is proposed for approximating any polynomial kernel in O(n(d+D \log{D})) time, and achieves higher accuracy and often runs orders of magnitude faster than the state-of-the-art approach for large-scale real-world datasets.
Efficient additive kernels via explicit feature maps
  • A. Vedaldi, Andrew Zisserman
  • Computer Science, Mathematics
    2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition
  • 2010
TLDR
It is shown that the χ2 kernel, which has been found to yield the best performance in most applications, also has the most compact feature representation, and is able to obtain a significant performance improvement over current state of the art results based on the intersection kernel.
On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning
TLDR
An algorithm to compute an easily-interpretable low-rank approximation to an n x n Gram matrix G such that computations of interest may be performed more rapidly.
Random Laplace Feature Maps for Semigroup Kernels on Histograms
TLDR
A new randomized technique called random Laplace features is developed, to approximate a family of kernel functions adapted to the semigroup structure of R+d, which is the natural algebraic structure on the set of histograms and other non-negative data representations.
Quasi-Monte Carlo Feature Maps for Shift-Invariant Kernels
TLDR
A new discrepancy measure called box discrepancy is derived based on theoretical characterizations of the integration error with respect to a given sequence based on explicit box discrepancy minimization in Quasi-Monte Carlo (QMC) approximations.
Efficient SVM Training Using Low-Rank Kernel Representations
TLDR
This work shows that for a low rank kernel matrix it is possible to design a better interior point method (IPM) in terms of storage requirements as well as computational complexity and derives an upper bound on the change in the objective function value based on the approximation error and the number of active constraints (support vectors).
Accelerating Stochastic Gradient Descent using Predictive Variance Reduction
TLDR
It is proved that this method enjoys the same fast convergence rate as those of stochastic dual coordinate ascent (SDCA) and Stochastic Average Gradient (SAG), but the analysis is significantly simpler and more intuitive.
...
...