• Corpus ID: 246431272

Preconditioning for Scalable Gaussian Process Hyperparameter Optimization

  title={Preconditioning for Scalable Gaussian Process Hyperparameter Optimization},
  author={Jonathan Wenger and Geoff Pleiss and Philipp Hennig and John P. Cunningham and Jacob R. Gardner},
Gaussian process hyperparameter optimization requires linear solves with, and log -determinants of, large kernel matrices. Iterative numerical tech-niques are becoming popular to scale to larger datasets, relying on the conjugate gradient method (CG) for the linear solves and stochastic trace estimation for the log -determinant. This work introduces new algorithmic and theoretical in-sights for preconditioning these computations. While preconditioning is well understood in the context of CG, we… 

Figures and Tables from this paper

Posterior and Computational Uncertainty in Gaussian Processes
A new class of methods is developed that provides consistent estimation of the combined uncertainty arising from both the finite number of data observed and the finite amount of computation expended, and the consequences of ignoring computational uncertainty are demonstrated.


Preconditioning Kernel Matrices
A scalable approach to both solving kernel machines and learning their hyperparameters is described, and it is shown this approach is exact in the limit of iterations and outperforms state-of-the-art approximations for a given computational budget.
Scalable Log Determinants for Gaussian Process Kernel Learning
It is found that Lanczos is generally superior to Chebyshev for kernel learning, and that a surrogate approach can be highly efficient and accurate with popular kernels.
Bias-Free Scalable Gaussian Processes via Randomized Truncations
This paper analyzes two common techniques: early truncated conjugate gradients (CG) and random Fourier features (RFF) and finds that both methods introduce a systematic bias on the learned hyperparameters: CG tends to underfit while RFF tends to overfit.
On the Use of Discrete Laplace Operator for Preconditioning Kernel Matrices
  • Jing Chen
  • Computer Science
    SIAM J. Sci. Comput.
  • 2013
The proposed preconditioning technique also applies to non-Toeplitz matrices, thus eliminating the reliance on a regular grid structure of the points, and it is proved equal distribution results on the spectrum of the resulting matrices.
Exact Gaussian Processes on a Million Data Points
A scalable approach for exact GPs is developed that leverages multi-GPU parallelization and methods like linear conjugate gradients, accessing the kernel matrix only through matrix multiplication, and is generally applicable, without constraints to grid data or specific kernel classes.
On Randomized Trace Estimates for Indefinite Matrices with an Application to Determinants
New tail bounds for randomized trace estimates applied to indefinite B with Rademacher or Gaussian random vectors are derived, which significantly improve existing results for indefinite B, reducing the number of required samples by a factor n or even more.
Optimal Rates for Random Fourier Features
A detailed finite-sample theoretical analysis about the approximation quality of RFFs is provided by establishing optimal (in terms of the RFF dimension, and growing set size) performance guarantees in uniform norm, and presenting guarantees in Lr (1 ≤ r < ∞) norms.
Thoughts on Massively Scalable Gaussian Processes
The MSGP framework enables the use of Gaussian processes on billions of datapoints, without requiring distributed inference, or severe assumptions, and reduces the standard GP learning and inference complexity to O(n), and the standard test point prediction complexity to $O(1).
Efficient High Dimensional Bayesian Optimization with Additivity and Quadrature Fourier Features
An efficient and provably no-regret Bayesian optimization algorithm for optimization of black-box functions in high dimensions and introduces a novel deterministic Fourier Features approximation based on numerical integration with detailed analysis for the squared exponential kernel.
On the Nyström Method for Approximating a Gram Matrix for Improved Kernel-Based Learning
An algorithm to compute an easily-interpretable low-rank approximation to an n x n Gram matrix G such that computations of interest may be performed more rapidly.