Sample Selection Bias Correction Theory

@article{Cortes2008SampleSB,
  title={Sample Selection Bias Correction Theory},
  author={Corinna Cortes and Mehryar Mohri and Michael Riley and Afshin Rostamizadeh},
  journal={ArXiv},
  year={2008},
  volume={abs/0805.2775}
}
This paper presents a theoretical analysis of sample selection bias correction. [] Key Method We analyze the effect of an error in that estimation on the accuracy of the hypothesis returned by the learning algorithm for two estimation techniques: a cluster-based estimation technique and kernel mean matching. We also report the results of sample bias correction experiments with several data sets using these techniques. Our analysis is based on the novel concept of distributional stabilitywhich generalizes the…
Effects of sampling skewness of the importance-weighted risk estimator on model selection.
TLDR
This work empirically show that the sampling distribution of an importance-weighted estimator can be skewed, and this over- and underestimates of the risk lead to sub-optimal regularization parameters when used for importance- weighed validation.
Fair and Robust Classification Under Sample Selection Bias
TLDR
This paper proposes a framework for robust and fair learning under sample selection bias, which adopts the reweighing estimation approach for bias correction and the minimax robust estimation Approach for achieving robustness on prediction accuracy.
Correcting Classifiers for Sample Selection Bias in Two-Phase Case-Control Studies
TLDR
This work provides guidance for choosing correction methods when training classifiers on biased samples and proposes two new resampling-based methods to resemble the original data and covariance structure: stochastic inverse-probability oversampling and parametric inverse-Probability bagging.
Robust Fairness-aware Learning Under Sample Selection Bias
TLDR
A framework for robust and fair learning under sample selection bias is proposed and the fairness is achieved under the worst case, which guarantees the model’s fairness on test data during the minimax optimization.
On reducing sampling variance in covariate shift using control variates
TLDR
It is shown that introducing a control variate can reduce the variance of the importance-weighted risk estimator, which leads to superior regularization parameter estimates when the training data is much smaller in scale than the test data.
Bias Correction for Replacement Samples in Longitudinal Research
TLDR
Four ways to correct the bias introduced by replacement samples are proposed and evaluated: a parametric bootstrapping replacement sample correction, a non-parametricBootstrapping Replacement Sample Correction, a primary inverse probability reweighting correction, and a likelihood-based inverse probability Reweighting Correction.
Nearest neighbor density ratio estimation for large-scale applications in astronomy
Multi-characteristic Subject Selection from Biased Datasets
TLDR
This paper presents a constrained optimizationbased method that finds the best possible sampling fractions for the different population subgroups, based on the desired sampling fractions provided by the researcher running the subject selection.
A characterization of sample selection bias in system evaluation and the case of information retrieval
  • M. Melucci
  • Economics
    International Journal of Data Science and Analytics
  • 2018
TLDR
The unbiased measure that is described in this paper awards the systems that poorly perform for difficult tasks, thus providing a better picture both of system efficiency and system ranking.
Domain adaptation and sample bias correction theory and algorithm for regression
...
...

References

SHOWING 1-10 OF 27 REFERENCES
Learning and evaluating classifiers under sample selection bias
TLDR
This paper formalizes the sample selection bias problem in machine learning terms and study analytically and experimentally how a number of well-known classifier learning methods are affected by it.
An improved categorization of classifier's sensitivity on sample selection bias
TLDR
It is argued that most classifier learners may or may not be affected by sample selection bias; this depends on the dataset as well as the heuristics or inductive bias implied by the learning algorithm and their appropriateness to the particular dataset.
Correcting sample selection bias in maximum entropy density estimation
We study the problem of maximum entropy density estimation in the presence of known sample selection bias. We propose three bias correction approaches. The first one takes advantage of unbiased
Direct importance estimation for covariate shift adaptation
TLDR
This paper proposes a direct importance estimation method that does not involve density estimation and is equipped with a natural cross validation procedure and hence tuning parameters such as the kernel width can be objectively optimized.
Sample selection bias as a specification error
Sample selection bias as a specification error This paper discusses the bias that results from using non-randomly selected samples to estimate behavioral relationships as an ordinary specification
Correcting Sample Selection Bias by Unlabeled Data
TLDR
A nonparametric method which directly produces resampling weights without distribution estimation is presented, which works by matching distributions between training and testing sets in feature space.
The Foundations of Cost-Sensitive Learning
TLDR
It is argued that changing the balance of negative and positive training examples has little effect on the classifiers produced by standard Bayesian and decision tree learning methods, and the recommended way of applying one of these methods is to learn a classifier from the training set and then to compute optimal decisions explicitly using the probability estimates given by the classifier.
Ridge Regression Learning Algorithm in Dual Variables
TLDR
A regression estimation algorithm which is a combination of the dual version of Ridge Regression is applied to the ANOVA enhancement of the infinitenode splines and the use of kernel functions, as used in Support Vector methods is introduced.
On the Influence of the Kernel on the Consistency of Support Vector Machines
TLDR
It is shown that the soft margin algorithms with universal kernels are consistent for a large class of classification problems including some kind of noisy tasks provided that the regularization parameter is chosen well.
Cost-sensitive learning by cost-proportionate example weighting
TLDR
Costing is proposed, a method based on cost-proportionate rejection sampling and ensemble aggregation, which achieves excellent predictive performance on two publicly available datasets, while drastically reducing the computation required by other methods.
...
...