• Corpus ID: 239768580

Nonuniform Negative Sampling and Log Odds Correction with Rare Events Data

@article{Wang2021NonuniformNS,
  title={Nonuniform Negative Sampling and Log Odds Correction with Rare Events Data},
  author={HaiYing Wang and Aonan Zhang and Chong Wang},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.13048}
}
We investigate the issue of parameter estimation with nonuniform negative sampling for imbalanced data. We first prove that, with imbalanced data, the available information about unknown parameters is only tied to the relatively small number of positive instances, which justifies the usage of negative sampling. However, if the negative instances are subsampled to the same level of the positive cases, there is information loss. To maintain more information, we derive the asymptotic distribution… 

Figures and Tables from this paper

Maximum sampled conditional likelihood for informative subsampling

TLDR
The asymptotic normality of the MSCLE is established and it is proved that its asymPTotic variance covariance matrix is the smallest among a class of asymptonically unbiased estimators, including the inverse probability weighted estimator.

Monolith: Real Time Recommendation System With Collisionless Embedding Table

TLDR
This paper presents Monolith 1, a system tailored for online training that crafted a collisionless embedding table with optimizations such as expirable embeddings and frequency filtering to reduce its memory footprint and proved that system reliability could be traded-off for real-time learning.

References

SHOWING 1-10 OF 50 REFERENCES

LOCAL CASE-CONTROL SAMPLING: EFFICIENT SUBSAMPLING IN IMBALANCED DATA SETS.

TLDR
This work proposes a method for subsampling efficiently for logistic regression by adjusting the class balance locally in feature space via an accept-reject scheme, and shows that this method can substantially outperform standard case-control subsampled.

Logistic Regression for Massive Data with Rare Events

TLDR
It is proved that under-sampling a small proportion of the nonevents, the resulting under- sampled estimator may have identical asymptotic distribution to the full data MLE, demonstrating the advantage of under-Sampling nonevent for rare events data, because this procedure may significantly reduce the computation and/or data collection costs.

More Efficient Estimation for Logistic Regression with Optimal Subsamples

TLDR
This paper proposes a more efficient estimator based on OSMAC subsample without weighting the likelihood function, and develops a new algorithm based on Poisson sampling, which does not require to approximate the optimal subsampling probabilities all at once.

Local Uncertainty Sampling for Large-Scale Multi-Class Logistic Regression

TLDR
It is shown that asymptotically, the proposed method always achieves a smaller variance than that of the uniform random sampling, and when the classes are conditionally imbalanced, significant improvement over uniform sampling can be achieved.

Optimal subsampling for quantile regression in big data

We investigate optimal subsampling for quantile regression. We derive the asymptotic distribution of a general subsampling estimator and then derive two versions of optimal subsampling

Logistic Regression in Rare Events Data

TLDR
It is shown that more efficient sampling designs exist for making valid inferences, such as sampling all available events and a tiny fraction of nonevents, which enables scholars to save as much as 99% of their (nonfixed) data collection costs or to collect much more meaningful explanatory variables.

Optimal Subsampling for Large Sample Logistic Regression

TLDR
A two-step algorithm is developed to efficiently approximate the maximum likelihood estimate in logistic regression and derive optimal subsampling probabilities that minimize the asymptotic mean squared error of the resultant estimator.

Optimal Distributed Subsampling for Maximum Quasi-Likelihood Estimators With Massive Data

TLDR
This article derives optimal Poisson subsampling probabilities in the context of quasi-likelihood estimation under the A- and L-optimality criteria, and establishes the consistency and asymptotic normality of the resultant estimators.

Bootstrap consistency for general semiparametric $M$-estimation

Consider $M$-estimation in a semiparametric model that is characterized by a Euclidean parameter of interest and an infinite-dimensional nuisance parameter. As a general purpose approach to

Optimal Subsampling with Influence Functions

TLDR
For linear regression models, which have well-studied procedures for non-uniform subsampling, the optimal influence function based method outperforms previous approaches even when using approximations to the optimal probabilities.