# Learning primal-dual sparse kernel machines

@article{Huusari2021LearningPS, title={Learning primal-dual sparse kernel machines}, author={Riikka Huusari and Sahely Bhadra and C{\'e}cile Capponi and Hachem Kadri and Juho Rousu}, journal={ArXiv}, year={2021}, volume={abs/2108.12199} }

Traditionally, kernel methods rely on the representer theorem which states that the solution to a learning problem is obtained as a linear combination of the data mapped into the reproducing kernel Hilbert space (RKHS). While elegant from theoretical point of view, the theorem is prohibitive for algorithms’ scalability to large datasets, and the interpretability of the learned function. In this paper, instead of using the traditional representer theorem, we propose to search for a solution in…

## 44 References

### On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning

- Computer Science, MathematicsJ. Mach. Learn. Res.
- 2005

An algorithm to compute an easily-interpretable low-rank approximation to an n x n Gram matrix G such that computations of interest may be performed more rapidly.

### Large-Scale Sparse Kernel Canonical Correlation Analysis

- Computer ScienceICML
- 2019

GradKCCA corresponds to solving KCCA with the additional constraint that the canonical projection directions in the kernel-induced feature space have preimages in the original data space, and it outperforms state-of-the-art CCA methods in terms of speed and robustness to noise both in simulated and real-world datasets.

### Kernel methods through the roof: handling billions of points efficiently

- Computer ScienceNeurIPS
- 2020

This work designed a preconditioned gradient solver for kernel methods exploiting both GPU acceleration and parallelization with multiple GPUs, implementing out-of-core variants of common linear algebra operations to guarantee optimal hardware utilization.

### Efficient projections onto the l1-ball for learning in high dimensions

- Computer ScienceICML '08
- 2008

Efficient algorithms for projecting a vector onto the l1-ball are described and variants of stochastic gradient projection methods augmented with these efficient projection procedures outperform interior point methods, which are considered state-of-the-art optimization techniques.

### FALKON: An Optimal Large Scale Kernel Method

- Computer ScienceNIPS
- 2017

This paper proposes FALKON, a novel algorithm that allows to efficiently process millions of points, derived combining several algorithmic principles, namely stochastic subsampling, iterative solvers and preconditioning.

### Multiple Kernel Learning Algorithms

- Computer ScienceJ. Mach. Learn. Res.
- 2011

Overall, using multiple kernels instead of a single one is useful and it is believed that combining kernels in a nonlinear or data-dependent way seems more promising than linear combination in fusing information provided by simple linear kernels, whereas linear methods are more reasonable when combining complex Gaussian kernels.

### Sparse kernel SVMs via cutting-plane training

- Computer ScienceMachine Learning
- 2009

An algorithm for training SVMs with Kernels that can represent the learned rule using arbitrary basis vectors, not just the support vectors from the training set is explored, which has the potential to make training of Kernel SVMs tractable for large training sets, where conventional methods scale quadratically due to the linear growth of the number of SVs.

### A Generalized Representer Theorem

- MathematicsCOLT/EuroCOLT
- 2001

The result shows that a wide range of problems have optimal solutions that live in the finite dimensional span of the training examples mapped into feature space, thus enabling us to carry out kernel algorithms independent of the (potentially infinite) dimensionality of the feature space.

### Reducing the Number of Support Vectors of SVM Classifiers Using the Smoothed Separable Case Approximation

- Computer ScienceIEEE Transactions on Neural Networks and Learning Systems
- 2012

An algorithm is proposed, called the smoothed SCA (SSCA), that additionally upper-bounds the weight vector of the pruned solution and, for the commonly used kernels, reduces the number of support vectors even more.

### On the Relationship between Generalization Error, Hypothesis Complexity, and Sample Complexity for Radial Basis Functions

- Computer ScienceNeural Computation
- 1996

This article shows that the generalization error can be decomposed into two terms: the approximation error, due to the insufficient representational capacity of a finite sized network, and the estimation error,due to insufficient information about the target function because of the finite number of samples.