• Corpus ID: 3814261

Sharp analysis of low-rank kernel matrix approximations

  title={Sharp analysis of low-rank kernel matrix approximations},
  author={Francis R. Bach},
  • F. Bach
  • Published 9 August 2012
  • Computer Science
  • ArXiv
We consider supervised learning problems within the positive-definite kernel framework, such as kernel ridge regression, kernel logistic regression or the support vector machine. With kernels leading to infinite-dimensional feature spaces, a common practical limiting difficulty is the necessity of computing the kernel matrix, which most frequently leads to algorithms with running time at least quadratic in the number of observations n, i.e., O(n^2). Low-rank approximations of the kernel matrix… 

Figures and Tables from this paper

Learning the kernel matrix via predictive low-rank approximations

The Mklaren algorithm to approximate multiple kernel matrices learn a regression model, which is entirely based on geometrical concepts, and outperforms contemporary kernel matrix approximation approaches when learning with multiple kernels.

On the Complexity of Learning with Kernels

There are kernel learning problems where no such method will lead to non-trivial computational savings, and lower bounds on the error attainable by such methods as a function of the number of entries observed in the kernel matrix or the rank of an approximate kernel matrix are studied.

Fast Randomized Kernel Ridge Regression with Statistical Guarantees

A version of this approach that comes with running time guarantees as well as improved guarantees on its statistical performance is described, and a fast algorithm is presented to quickly compute coarse approximations to these scores in time linear in the number of samples.

ℓp-norm Multiple Kernel Learning with Low-rank Kernels

Provably Useful Kernel Matrix Approximation in Linear Time

We give the first algorithms for kernel matrix approximation that run in time linear in the number of data points and output an approximation which gives provable guarantees when used in many

Scaling Up Kernel SVM on Limited Resources: A Low-Rank Linearization Approach

This paper proposes a novel approach called low-rank linearized SVM to scale up kernel SVM on limited resources via an approximate empirical kernel map computed from efficient kernel low- rank decompositions.

Randomized Nyström Features for Fast Regression: An Error Analysis

It is empirically shown that using the l randomly selected columns of a kernel matrix for a construction of m-dimensional random feature vectors produces smaller error on a regression problem, than using m randomly selected Columns.

On expected error of randomized Nyström kernel regression

It is proved that the error of a predictor, learned via this method is almost the same in expectation as theerror of a kernel predictor, and the randomized SVD method is applied instead of the spectral decomposition to reduce the time complexity.

Fast Randomized Kernel Methods With Statistical Guarantees

A version of this approach that comes with running time guarantees as well as improved guarantees on its statistical performance is described, and a new notion of the statistical leverage of a data point captures in a fine way the difficulty of the original statistical learning problem.

Faster Kernel Ridge Regression Using Sketching and Preconditioning

This paper proposes a preconditioning technique based on random feature maps, such as random Fourier features, which have recently emerged as a powerful technique for speeding up and scaling the training of kernel-based methods by resorting to approximations.



Efficient SVM Training Using Low-Rank Kernel Representations

This work shows that for a low rank kernel matrix it is possible to design a better interior point method (IPM) in terms of storage requirements as well as computational complexity and derives an upper bound on the change in the objective function value based on the approximation error and the number of active constraints (support vectors).

Predictive low-rank decomposition for kernel methods

This paper presents an algorithm that can exploit side information (e.g., classification labels, regression responses) in the computation of low-rank decompositions for kernel matrices and presents simulation results that show that the algorithm yields decomposition of significantly smaller rank than those found by incomplete Cholesky decomposition.

On the Impact of Kernel Approximation on Learning Accuracy

Stability bounds based on the norm of the kernel approximation for these algorithms, including SVMs, KRR, and graph Laplacian-based regularization algorithms, are given to determine the degree of approximation that can be tolerated in the estimation of thekernel matrix.

Improved Bounds for the Nyström Method With Application to Kernel Classification

A kernel classification approach based on the Nyström method is presented and it is shown that when the eigenvalues of the kernel matrix follow a p-power law, the number of support vectors can be reduced to N2p/(p2 - 1), which is sublinear in N when p > 1+√2, without seriously sacrificing its generalization performance.

Compressed Least-Squares Regression

It is shown that solving the problem in the compressed domain instead of the initial domain reduces the estimation error at the price of an increased (but controlled) approximation error.

Optimal Rates for the Regularized Least-Squares Algorithm

A complete minimax analysis of the problem is described, showing that the convergence rates obtained by regularized least-squares estimators are indeed optimal over a suitable class of priors defined by the considered kernel.

Breaking the curse of kernelization: budgeted stochastic gradient descent for large-scale SVM training

Comprehensive empirical results show that BSGD achieves higher accuracy than the state-of-the-art budgeted online algorithms and comparable to non-budget algorithms, while achieving impressive computational efficiency both in time and space during training and prediction.

Random Features for Large-Scale Kernel Machines

Two sets of random features are explored, provided convergence bounds on their ability to approximate various radial basis kernels, and it is shown that in large-scale classification and regression tasks linear machine learning algorithms applied to these features outperform state-of-the-art large- scale kernel machines.

A high-dimensional Wilks phenomenon

A theorem by Wilks asserts that in smooth parametric density estimation the difference between the maximum likelihood and the likelihood of the sampling distribution converges toward a Chi-square

Fast Sparse Gaussian Process Methods: The Informative Vector Machine

A framework for sparse Gaussian process (GP) methods which uses forward selection with criteria based on information-theoretic principles, which allows for Bayesian model selection and is less complex in implementation is presented.