A statistical perspective of sampling scores for linear regression

  title={A statistical perspective of sampling scores for linear regression},
  author={Siheng Chen and R. Varma and Aarti Singh and J. Kovacevic},
  journal={2016 IEEE International Symposium on Information Theory (ISIT)},
In this paper, we consider a statistical problem of learning a linear model from noisy samples. Existing work has focused on approximating the least squares solution by using leverage-based scores as an importance sampling distribution. However, no finite sample statistical guarantees and no computationally efficient optimal sampling strategies have been proposed. To evaluate the statistical properties of different sampling strategies, we propose a simple yet effective estimator, which is easy… Expand
Asymptotic Analysis of Sampling Estimators for Randomized Numerical Linear Algebra Algorithms
An asymptotic analysis is developed to derive the distribution of RandNLA sampling estimators for the least-squares problem and the role of leverage in the sampling process, and the empirical results demonstrate improvements over existing methods. Expand
An econometric perspective on algorithmic subsampling
This paper reviews a line of work that is grounded in theoretical computer science and numerical linear algebra, and finds that an algorithmically desirable sketch, which is a randomly chosen subset of the data, must preserve the eigenstructure of theData, a property known as a subspace embedding. Expand
An Econometric View of Algorithmic Subsampling
This paper reviews a line of work that is grounded in theoretical computer science and numerical linear algebra, and finds that an algorithmically desirable sketch of the data must have a {\em subspace embedding} property, and study how prediction and inference is affected by data sketching within a linear regression setup. Expand
Determinantal Point Processes for Coresets
It is shown that the coreset property holds for samples formed with determinantal point processes (DPP), a rare example of repulsive point processes with tractable theoretical properties, enabling us to construct general coreset theorems. Expand
How to reduce dimension with PCA and random projections
In our "big data" age, the size and complexity of data is steadily increasing. Methods for dimension reduction are ever more popular and useful. Two distinct types of dimension reduction areExpand
A Selective Review on Statistical Techniques for Big Data
To meet the big data challenges, many new statistical tools have been developed in recent years. In this review, we summarize some of these approaches to give an overview of the current state of theExpand
Random sampling of bandlimited signals on graphs
Signal processing on graphs on graphs and machine learning using EPFL-TALK-214904 as a model for graph supervised learning. Expand
Fourier Sparse Leverage Scores and Approximate Kernel Learning
New explicit upper bounds on the leverage scores of Fourier sparse functions under both the Gaussian and Laplace measures are proved, which generalize existing work that only applies to uniformly distributed data. Expand


A statistical perspective on algorithmic leveraging
This work provides an effective framework to evaluate the statistical properties of algorithmic leveraging in the context of estimating parameters in a linear regression model and shows that from the statistical perspective of bias and variance, neither leverage-based sampling nor uniform sampling dominates the other. Expand
Optimal Subsampling Approaches for Large Sample Linear Regression
A significant hurdle for analyzing large sample data is the lack of effective statistical computing and inference methods. An emerging powerful approach for analyzing large sample data isExpand
An Explicit Sampling Dependent Spectral Error Bound for Column Subset Selection
By solving a constrained optimization problem related to the error bound with an efficient bisection search, this paper is able to achieve better performance than using either the leverage-based distribution or that proportional to the square root of the statistical leverage scores. Expand
New Subsampling Algorithms for Fast Least Squares Regression
This work proposes three methods which solve the big data problem by subsampling the covariance matrix using either a single or two stage estimation of ordinary least squares from large amounts of data with an error bound of O(√p/n). Expand
An empirical comparison of sampling techniques for matrix column subset selection
  • Yining Wang, Aarti Singh
  • Mathematics, Computer Science
  • 2015 53rd Annual Allerton Conference on Communication, Control, and Computing (Allerton)
  • 2015
This paper revisits iterative norm sampling, another sampling based CSS algorithm proposed even before leverage score sampling, and demonstrates its competitive performance under a wide range of experimental settings. Expand
Fast approximation of matrix coherence and statistical leverage
A randomized algorithm is proposed that takes as input an arbitrary n × d matrix A, with n ≫ d, and returns, as output, relative-error approximations to all n of the statistical leverage scores. Expand
Random Projections for the Nonnegative Least-Squares Problem
This work presents a fast random projection type approximation algorithm for the Nonnegative Least Squares problem that employs a randomized Hadamard transform to construct a much smaller problem and solves this smaller problem using a standard NNLS solver and proves that the approach finds a non negative solution vector that, with high probability, is close to the optimum nonnegative solution in a relative error approximation sense. Expand
Signal recovery on graphs: Random versus experimentally designed sampling
A new class of smooth graph signals is proposed, called approximately bandlimited, which uses sampling scores, which is similar to the leverage scores used in the matrix approximation for signal recovery on graphs based on random sampling and experimentally designed sampling. Expand
Randomized Algorithms for Matrices and Data
This monograph will provide a detailed overview of recent work on the theory of randomized matrix algorithms as well as the application of those ideas to the solution of practical problems in large-scale data analysis. Expand
oASIS: Adaptive Column Sampling for Kernel Matrix Approximation
A new adaptive sampling algorithm called Accelerated Sequential Incoherence Selection (oASIS) that samples columns without explicitly computing the entire kernel matrix to enable the solution of large problems that are simply intractable using other adaptive methods. Expand