# Nonuniform Negative Sampling and Log Odds Correction with Rare Events Data

@article{Wang2021NonuniformNS, title={Nonuniform Negative Sampling and Log Odds Correction with Rare Events Data}, author={HaiYing Wang and Aonan Zhang and Chong Wang}, journal={ArXiv}, year={2021}, volume={abs/2110.13048} }

We investigate the issue of parameter estimation with nonuniform negative sampling for imbalanced data. We first prove that, with imbalanced data, the available information about unknown parameters is only tied to the relatively small number of positive instances, which justifies the usage of negative sampling. However, if the negative instances are subsampled to the same level of the positive cases, there is information loss. To maintain more information, we derive the asymptotic distribution…

## 3 Citations

### Maximum sampled conditional likelihood for informative subsampling

- MathematicsArXiv
- 2020

The asymptotic normality of the MSCLE is established and it is proved that its asymPTotic variance covariance matrix is the smallest among a class of asymptonically unbiased estimators, including the inverse probability weighted estimator.

### A note on centering in subsample selection for linear regression

- MathematicsStat
- 2022

Centering is a commonly used technique in linear regression analysis. With centered data on both the responses and covariates, the ordinary least squares estimator of the slope parameter can be…

### Monolith: Real Time Recommendation System With Collisionless Embedding Table

- Computer ScienceArXiv
- 2022

This paper presents Monolith 1, a system tailored for online training that crafted a collisionless embedding table with optimizations such as expirable embeddings and frequency filtering to reduce its memory footprint and proved that system reliability could be traded-off for real-time learning.

## References

SHOWING 1-10 OF 50 REFERENCES

### LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS.

- Mathematics, Computer ScienceAnnals of statistics
- 2014

This work proposes a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme, and shows that this method can substantially outperform standard case-control subsampled.

### Logistic Regression for Massive Data with Rare Events

- MathematicsICML
- 2020

It is proved that under-sampling a small proportion of the nonevents, the resulting under- sampled estimator may have identical asymptotic distribution to the full data MLE, demonstrating the advantage of under-Sampling nonevent for rare events data, because this procedure may significantly reduce the computation and/or data collection costs.

### More Efficient Estimation for Logistic Regression with Optimal Subsamples

- MathematicsJ. Mach. Learn. Res.
- 2019

This paper proposes a more efficient estimator based on OSMAC subsample without weighting the likelihood function, and develops a new algorithm based on Poisson sampling, which does not require to approximate the optimal subsampling probabilities all at once.

### Local Uncertainty Sampling for Large-Scale Multi-Class Logistic Regression

- Computer ScienceThe Annals of Statistics
- 2020

It is shown that asymptotically, the proposed method always achieves a smaller variance than that of the uniform random sampling, and when the classes are conditionally imbalanced, significant improvement over uniform sampling can be achieved.

### Optimal subsampling for quantile regression in big data

- MathematicsBiometrika
- 2020

We investigate optimal subsampling for quantile regression. We derive the asymptotic distribution of a general subsampling estimator and then derive two versions of optimal subsampling…

### Logistic Regression in Rare Events Data

- Political SciencePolitical Analysis
- 2001

It is shown that more efficient sampling designs exist for making valid inferences, such as sampling all available events and a tiny fraction of nonevents, which enables scholars to save as much as 99% of their (nonfixed) data collection costs or to collect much more meaningful explanatory variables.

### Optimal Subsampling for Large Sample Logistic Regression

- Mathematics, Computer ScienceJournal of the American Statistical Association
- 2018

A two-step algorithm is developed to efficiently approximate the maximum likelihood estimate in logistic regression and derive optimal subsampling probabilities that minimize the asymptotic mean squared error of the resultant estimator.

### Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators With Massive Data

- Mathematics, Computer ScienceJournal of the American Statistical Association
- 2020

This article derives optimal Poisson subsampling probabilities in the context of quasi-likelihood estimation under the A- and L-optimality criteria, and establishes the consistency and asymptotic normality of the resultant estimators.

### Bootstrap consistency for general semiparametric $M$-estimation

- Mathematics, Economics
- 2009

Consider $M$-estimation in a semiparametric model that is characterized by a Euclidean parameter of interest and an infinite-dimensional nuisance parameter. As a general purpose approach to…

### Optimal Subsampling with Influence Functions

- Computer Science, MathematicsNeurIPS
- 2018

For linear regression models, which have well-studied procedures for non-uniform subsampling, the optimal influence function based method outperforms previous approaches even when using approximations to the optimal probabilities.