Surprise sampling: Improving and extending the local case-control sampling

  title={Surprise sampling: Improving and extending the local case-control sampling},
  author={Xinwei Shen and Kani Chen and Wen Yu},
  journal={arXiv: Methodology},
Fithian and Hastie (2014) proposed a new sampling scheme called local case-control (LCC) sampling that achieves stability and efficiency by utilizing a clever adjustment pertained to the logistic model. It is particularly useful for classification with large and imbalanced data. This paper proposes a more general sampling scheme based on a working principle that data points deserve higher sampling probability if they contain more information or appear "surprising" in the sense of, for example… 

Figures and Tables from this paper

Nearly optimal capture-recapture sampling and empirical likelihood weighting estimation for M-estimation with big data

The ELW method overcomes the instability of IPW by circumventing the use of inverse probabilities, and utilizes auxiliary information including the size and certain sample moments of big data, leading to more efficient optimal sampling plans and more economical sample sizes for a prespecified estimation precision.

Maximum sampled conditional likelihood for informative subsampling

The asymptotic normality of the MSCLE is established and it is proved that its asymPTotic variance covariance matrix is the smallest among a class of asymptonically unbiased estimators, including the inverse probability weighted estimator.




This work proposes a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme, and shows that this method can substantially outperform standard case-control subsampled.

A statistical perspective on algorithmic leveraging

This work provides an effective framework to evaluate the statistical properties of algorithmic leveraging in the context of estimating parameters in a linear regression model and shows that from the statistical perspective of bias and variance, neither leverage-based sampling nor uniform sampling dominates the other.

Information-Based Optimal Subdata Selection for Big Data Linear Regression

Theoretical results and extensive simulations demonstrate that the IBOSS approach is superior to subsampling-based methods, sometimes by orders of magnitude, and the advantages of the new approach are also illustrated through analysis of real data.

Optimal Subsampling Algorithms for Big Data Generalized Linear Models

The scope of the OSMAC framework is extended to include generalized linear models with canonical link functions, and the consistency and asymptotic normality of the estimator from a general subsampling algorithm are established, and optimal subsamplings probabilities under the A- and L-optimality criteria are derived.

Optimal Subsampling for Large Sample Logistic Regression

A two-step algorithm is developed to efficiently approximate the maximum likelihood estimate in logistic regression and derive optimal subsampling probabilities that minimize the asymptotic mean squared error of the resultant estimator.

Generalized case–cohort sampling

A class of cohort sampling designs, including nested case–control, case–cohort and classical case–control designs involving survival data, is studied through a unified approach using Cox’s

End-point Sampling

This paper proposes a new retrospective sampling design, called end-point sampling, which improves the efficiency of the case-cohort and case-control designs, and the regression analysis is conducted using the Cox model.

A Survey of Predictive Modelling under Imbalanced Distributions

The main challenges raised by imbalanced distributions are discussed, the main approaches to these problems are described, a taxonomy of these methods is proposed and some related problems within predictive modelling are referred to.

Estimability and estimation in case-referent studies.

The concepts that case-referent studies provide for the estimation of "relative risk" only if the illness is "rare", and that the rates and risks themselves are inestimable, are overly superficial

Case-cohort and case-control analysis with Cox's model

Prentice (1986) proposed the case-cohort design and studied a pseudolikelihood estimator of regression parameters in Cox's model. We derive a class of estimating equations for case-cohort sampling,