author={William Fithian and Trevor J. Hastie},
  journal={Annals of statistics},
  volume={42 5},
For classification problems with significant class imbalance, subsampling can reduce computational costs at the price of inflated variance in estimating model parameters. We propose a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme. Our method generalizes standard case-control sampling, using a pilot estimate to preferentially select examples whose responses are conditionally rare given their features… 

Figures and Tables from this paper

Nonuniform Negative Sampling and Log Odds Correction with Rare Events Data

This work proves that, with imbalanced data, the available information about unknown parameters is only tied to the relatively small number of positive instances, which justifies the usage of negative sampling, and derive the asymptotic distribution of a general inverse probability weighted (IPW) estimator and obtain the optimal sampling probability that minimizes its variance.

Efficient posterior sampling for high-dimensional imbalanced logistic regression.

Classification with high-dimensional data is of widespread interest and often involves dealing with imbalanced data. Bayesian classification approaches are hampered by the fact that current Markov

Optimal subsampling for linear quantile regression models

Subsampling techniques are efficient methods for handling big data. Quite a few optimal sampling methods have been developed for parametric models in which the loss functions are differentiable with

Optimal Subsampling for Large Sample Logistic Regression

A two-step algorithm is developed to efficiently approximate the maximum likelihood estimate in logistic regression and derive optimal subsampling probabilities that minimize the asymptotic mean squared error of the resultant estimator.

Local Uncertainty Sampling for Large-Scale Multi-Class Logistic Regression

It is shown that asymptotically, the proposed method always achieves a smaller variance than that of the uniform random sampling, and when the classes are conditionally imbalanced, significant improvement over uniform sampling can be achieved.

Less Is Better: Unweighted Data Subsampling via Influence Function

This work proposes a novel Unweighted Influence Data Subsampling (UIDS) method, and proves that the subset-model acquired through the method can outperform the full-set-model.

Surprise sampling: Improving and extending the local case-control sampling

A more general sampling scheme based on a working principle that data points deserve higher sampling probability if they contain more information or appear "surprising" in the sense of, for example, a large error of pilot prediction or a large absolute score.

Unweighted estimation based on optimal sample under measurement constraints

To tackle massive data, subsampling is a practical approach to select the more informative data points. However, when responses are expensive to measure, developing efficient subsampling schemes is

More Efficient Estimation for Logistic Regression with Optimal Subsamples

This paper proposes a more efficient estimator based on OSMAC subsample without weighting the likelihood function, and develops a new algorithm based on Poisson sampling, which does not require to approximate the optimal subsampling probabilities all at once.



The design and analysis of case-control studies with biased sampling.

A design is proposed for case-control studies in which selection of subjects for full variable ascertainment is based jointly on disease status and on easily obtained "screening" variables that may

Logistic regression methods for retrospective case-control studies using complex sampling procedures.

This work considers the case-control problem with stratified samples and assumes a logistic model that does not include terms for strata, i.e., for fixed covariates the (prospective) probability of disease does not depend on stratum, and obtains the maximum likelihood estimators for all parameters in the logisticmodel.

Logistic regression for two-stage case-control data

SUMMARY Samples of diseased cases and nondiseased controls are drawn at random from the population at risk. After classification according to the exposure of interest, subsamples of cases and

Infinitely Imbalanced Logistic Regression

  • A. Owen
  • Mathematics
    J. Mach. Learn. Res.
  • 2007
The infinitely imbalanced case where one class has a finite sample size and the other class's sample size grows without bound is considered, which suggests a computational shortcut for fraud detection problems.

Fitting Logistic Regression Models in Stratified Case-Control Studies

SUMMARY Methods are developed for fitting logistic models to data in which cases and/or controls are sampled from the available cases and controls within population strata. Particular attention is

Special Invited Paper-Additive logistic regression: A statistical view of boosting

This work shows that this seemingly mysterious phenomenon of boosting can be understood in terms of well-known statistical principles, namely additive modeling and maximum likelihood, and develops more direct approximations and shows that they exhibit nearly identical results to boosting.

A decision-theoretic generalization of on-line learning and an application to boosting

The model studied can be interpreted as a broad, abstract extension of the well-studied on-line prediction model to a general decision-theoretic setting, and it is shown that the multiplicative weight-update Littlestone?Warmuth rule can be adapted to this model, yielding bounds that are slightly weaker in some cases, but applicable to a considerably more general class of learning problems.

A Generalization of Sampling Without Replacement from a Finite Universe

Abstract This paper presents a general technique for the treatment of samples drawn without replacement from finite universes when unequal selection probabilities are used. Two sampling schemes are

On the robustness of weighted methods for fitting models to case–control data

Summary. We compare the robustness under model misspecification of two approaches to fitting logistic regression models with unmatched case–control data. One is the standard survey approach based on

Logistic disease incidence models and case-control studies

SUMMARY The probability of disease development in a defined time period is described by a logistic regression model. A model for the regression variable, given disease status, is induced and is