# Optimal Subsampling for Large Sample Logistic Regression

@article{Wang2017OptimalSF, title={Optimal Subsampling for Large Sample Logistic Regression}, author={Haiying Wang and Rong Zhu and Ping Ma}, journal={Journal of the American Statistical Association}, year={2017}, volume={113}, pages={829 - 844} }

ABSTRACT For massive data, the family of subsampling algorithms is popular to downsize the data volume and reduce computational burden. Existing studies focus on approximating the ordinary least-square estimate in linear regression, where statistical leverage scores are often used to define subsampling probabilities. In this article, we propose fast subsampling algorithms to efficiently approximate the maximum likelihood estimate in logistic regression. We first establish consistency and…

## 148 Citations

### Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators With Massive Data

- Mathematics, Computer ScienceJournal of the American Statistical Association
- 2020

This article derives optimal Poisson subsampling probabilities in the context of quasi-likelihood estimation under the A- and L-optimality criteria, and establishes the consistency and asymptotic normality of the resultant estimators.

### Optimal subsampling for multiplicative regression with massive data

- Mathematics, Computer ScienceStatistica Neerlandica
- 2022

An efficient subsampling method for large‐scale multiplicative regression model, which can largely reduce the computational burden due to massive data.

### OPTIMAL SUBSAMPLING ALGORITHMS FOR BIG DATA REGRESSIONS

- Mathematics, Computer ScienceStatistica Sinica
- 2021

To fast approximate maximum likelihood estimators with massive data, this paper studies the Optimal Subsampling Method under the A-optimality Criterion (OSMAC) for generalized linear models. The…

### Optimal subsampling for quantile regression in big data

- MathematicsBiometrika
- 2020

We investigate optimal subsampling for quantile regression. We derive the asymptotic distribution of a general subsampling estimator and then derive two versions of optimal subsampling…

### Approximating Partial Likelihood Estimators via Optimal Subsampling

- Mathematics, Computer Science
- 2022

A fast and stable subsampling method to approximate the full data maximum partial likelihood estimator in Cox’s model, which reduces the computational burden when analyzing massive survival data.

### Optimal Subsampling for Large Sample Ridge Regression

- Computer Science, Mathematics
- 2022

This paper develops an eﬃcient subsampling procedure for the large sample linear ridge regression and proposes to minimize the asymptotic-mean-squared-error criterion for optimality.

### Optimal Poisson Subsampling for Softmax Regression∗

- Computer Science, Mathematics
- 2021

The asymptotic properties of the general Poisson subsampling estimator are derived and the optimal subsampled probabilities are obtained by minimizing the asymPTotic variance-covariance matrix under both Aand Loptimality criteria.

### Optimal subsampling for composite quantile regression in big data

- Mathematics, Computer ScienceStatistical Papers
- 2022

This work establishes the consistency and asymptotic normality of the CQR estimator from a general subsampling algorithm and derives the optimal subsamplings probabilities under the L- and A-optimality criteria.

### Optimal subsampling for large-scale quantile regression

- Mathematics, Computer ScienceJ. Complex.
- 2021

### Sampling-based Gaussian Mixture Regression for Big Data

- Mathematics, Computer ScienceJournal of Data Science
- 2022

This paper proposes a nonuniform subsampling method for finite mixtures of regression models to reduce large data computational tasks and evaluates the proposed method in a simulation study and presents a real data example.

## References

SHOWING 1-10 OF 30 REFERENCES

### LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS.

- Mathematics, Computer ScienceAnnals of statistics
- 2014

This work proposes a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme, and shows that this method can substantially outperform standard case-control subsampled.

### New Subsampling Algorithms for Fast Least Squares Regression

- Computer ScienceNIPS
- 2013

This work proposes three methods which solve the big data problem by subsampling the covariance matrix using either a single or two stage estimation of ordinary least squares from large amounts of data with an error bound of O(√p/n).

### Fast and Robust Least Squares Estimation in Corrupted Linear Models

- Computer ScienceNIPS
- 2014

Under a general model of corrupted observations, the concept of influence can be used to detect such corrupted observations as shown in this paper and the proposed subsampling algorithm improves over the current state-of-the-art approximation schemes for ordinary least squares.

### A statistical perspective on algorithmic leveraging

- Computer ScienceJ. Mach. Learn. Res.
- 2015

This work provides an effective framework to evaluate the statistical properties of algorithmic leveraging in the context of estimating parameters in a linear regression model and shows that from the statistical perspective of bias and variance, neither leverage-based sampling nor uniform sampling dominates the other.

### Faster least squares approximation

- Computer Science, MathematicsNumerische Mathematik
- 2011

This work presents two randomized algorithms that provide accurate relative-error approximations to the optimal value and the solution vector of a least squares approximation problem more rapidly than existing exact algorithms.

### Leveraging for big data regression

- Computer Science
- 2015

Leveraging methods stand as the very unique development of their type in big data analytics and allow pervasive access to massive amounts of information without resorting to high performance computing and cloud computing.

### Fast approximation of matrix coherence and statistical leverage

- Computer ScienceICML
- 2012

A randomized algorithm is proposed that takes as input an arbitrary n × d matrix A, with n ≫ d, and returns, as output, relative-error approximations to all n of the statistical leverage scores.

### A fast randomized algorithm for overdetermined linear least-squares regression

- Mathematics, Computer ScienceProceedings of the National Academy of Sciences
- 2008

A randomized algorithm for overdetermined linear least-squares regression based on QR-decompositions or bidiagonalization that computes an n × 1 vector x such that x minimizes the Euclidean norm ‖Ax − b‖ to relative precision ε.

### Low-Rank Approximation and Regression in Input Sparsity Time

- Computer ScienceArXiv
- 2012

We design a new distribution over m × n matrices S so that, for any fixed n × d matrix A of rank r, with probability at least 9/10, ∥SAx∥2 = (1 ± ε)∥Ax∥2 simultaneously for all x ∈ Rd. Here, m is…

### CUR matrix decompositions for improved data analysis

- Computer ScienceProceedings of the National Academy of Sciences
- 2009

An algorithm is presented that preferentially chooses columns and rows that exhibit high “statistical leverage” and exert a disproportionately large “influence” on the best low-rank fit of the data matrix, obtaining improved relative-error and constant-factor approximation guarantees in worst-case analysis, as opposed to the much coarser additive-error guarantees of prior work.