Optimal Subsampling for Large Sample Logistic Regression

  title={Optimal Subsampling for Large Sample Logistic Regression},
  author={Haiying Wang and Rong Zhu and Ping Ma},
  journal={Journal of the American Statistical Association},
  pages={829 - 844}
ABSTRACT For massive data, the family of subsampling algorithms is popular to downsize the data volume and reduce computational burden. Existing studies focus on approximating the ordinary least-square estimate in linear regression, where statistical leverage scores are often used to define subsampling probabilities. In this article, we propose fast subsampling algorithms to efficiently approximate the maximum likelihood estimate in logistic regression. We first establish consistency and… 

Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators With Massive Data

This article derives optimal Poisson subsampling probabilities in the context of quasi-likelihood estimation under the A- and L-optimality criteria, and establishes the consistency and asymptotic normality of the resultant estimators.

Optimal subsampling for multiplicative regression with massive data

An efficient subsampling method for large‐scale multiplicative regression model, which can largely reduce the computational burden due to massive data.


To fast approximate maximum likelihood estimators with massive data, this paper studies the Optimal Subsampling Method under the A-optimality Criterion (OSMAC) for generalized linear models. The

Optimal subsampling for quantile regression in big data

We investigate optimal subsampling for quantile regression. We derive the asymptotic distribution of a general subsampling estimator and then derive two versions of optimal subsampling

Approximating Partial Likelihood Estimators via Optimal Subsampling

A fast and stable subsampling method to approximate the full data maximum partial likelihood estimator in Cox’s model, which reduces the computational burden when analyzing massive survival data.

Optimal Subsampling for Large Sample Ridge Regression

This paper develops an efficient subsampling procedure for the large sample linear ridge regression and proposes to minimize the asymptotic-mean-squared-error criterion for optimality.

Optimal Poisson Subsampling for Softmax Regression∗

The asymptotic properties of the general Poisson subsampling estimator are derived and the optimal subsampled probabilities are obtained by minimizing the asymPTotic variance-covariance matrix under both Aand Loptimality criteria.

Optimal subsampling for composite quantile regression in big data

This work establishes the consistency and asymptotic normality of the CQR estimator from a general subsampling algorithm and derives the optimal subsamplings probabilities under the L- and A-optimality criteria.

Sampling-based Gaussian Mixture Regression for Big Data

This paper proposes a nonuniform subsampling method for finite mixtures of regression models to reduce large data computational tasks and evaluates the proposed method in a simulation study and presents a real data example.




This work proposes a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme, and shows that this method can substantially outperform standard case-control subsampled.

New Subsampling Algorithms for Fast Least Squares Regression

This work proposes three methods which solve the big data problem by subsampling the covariance matrix using either a single or two stage estimation of ordinary least squares from large amounts of data with an error bound of O(√p/n).

Fast and Robust Least Squares Estimation in Corrupted Linear Models

Under a general model of corrupted observations, the concept of influence can be used to detect such corrupted observations as shown in this paper and the proposed subsampling algorithm improves over the current state-of-the-art approximation schemes for ordinary least squares.

A statistical perspective on algorithmic leveraging

This work provides an effective framework to evaluate the statistical properties of algorithmic leveraging in the context of estimating parameters in a linear regression model and shows that from the statistical perspective of bias and variance, neither leverage-based sampling nor uniform sampling dominates the other.

Faster least squares approximation

This work presents two randomized algorithms that provide accurate relative-error approximations to the optimal value and the solution vector of a least squares approximation problem more rapidly than existing exact algorithms.

Leveraging for big data regression

Leveraging methods stand as the very unique development of their type in big data analytics and allow pervasive access to massive amounts of information without resorting to high performance computing and cloud computing.

Fast approximation of matrix coherence and statistical leverage

A randomized algorithm is proposed that takes as input an arbitrary n × d matrix A, with n ≫ d, and returns, as output, relative-error approximations to all n of the statistical leverage scores.

A fast randomized algorithm for overdetermined linear least-squares regression

  • V. RokhlinM. Tygert
  • Mathematics, Computer Science
    Proceedings of the National Academy of Sciences
  • 2008
A randomized algorithm for overdetermined linear least-squares regression based on QR-decompositions or bidiagonalization that computes an n × 1 vector x such that x minimizes the Euclidean norm ‖Ax − b‖ to relative precision ε.

Low-Rank Approximation and Regression in Input Sparsity Time

We design a new distribution over m × n matrices S so that, for any fixed n × d matrix A of rank r, with probability at least 9/10, ∥SAx∥2 = (1 ± ε)∥Ax∥2 simultaneously for all x ∈ Rd. Here, m is

CUR matrix decompositions for improved data analysis

An algorithm is presented that preferentially chooses columns and rows that exhibit high “statistical leverage” and exert a disproportionately large “influence” on the best low-rank fit of the data matrix, obtaining improved relative-error and constant-factor approximation guarantees in worst-case analysis, as opposed to the much coarser additive-error guarantees of prior work.