Corpus ID: 88518017

On the Use of Random Forest for Two-Sample Testing.

  title={On the Use of Random Forest for Two-Sample Testing.},
  author={Simon Hediger and Loris Michel and Jeffrey Naf},
  journal={arXiv: Methodology},
We follow the line of using classifiers for two-sample testing and propose several tests based on the Random Forest classifier. The developed tests are easy to use, require no tuning and are applicable for any distribution on $\mathbb{R}^p$, even in high-dimensions. We provide a comprehensive treatment for the use of classification for two-sample testing, derive the distribution of our tests under the Null and provide a power analysis, both in theory and with simulations. To simplify the use of… Expand
PKLM: A flexible MCAR test using Classification
We develop a fully non-parametric, fast, easy-to-use, and powerful test for the missing completely at random (MCAR) assumption on the missingness mechanism of a data set. The test comparesExpand
Global and local two-sample tests via regression
Two-sample testing is a fundamental problem in statistics. Despite its long history, there has been renewed interest in this problem with the advent of high-dimensional and complex data.Expand
A Fast and Effective Large-Scale Two-Sample Test Based on Kernels
  • Hoseung Song, Hao Chen
  • Mathematics
  • 2021
Kernel two-sample tests have been widely used and the development of efficient methods for high-dimensional large-scale data is gaining more and more attention as we are entering the big data era.Expand
Local Two-Sample Testing over Graphs and Point-Clouds by Random-Walk Distributions.
Two-sample testing is a fundamental tool for scientific discovery. Yet, aside from concluding that two samples do not come from the same probability distribution, it is often of interest toExpand
High Probability Lower Bounds for the Total Variation Distance
The statistics and machine learning communities have recently seen a growing interest in classification-based approaches to two-sample testing (e.g. Kim et al. [2016]; Rosenblatt et al. [2016];Expand
WMW-A: Rank-based two-sample independent test for smallsample sizes through an auxiliary sample
The extensive simulation experiments and real applications on microarray gene expression data sets show the WMW-A test could significantly improve the test power for two-sample problem with small sample sizes, by either available unlabelled auxiliary data or generated auxiliary data. Expand
Optimizing the synthesis of clinical trial data using sequential trees
The optimization approach presented in this study gives a reliable way to synthesize high-utility clinical trial datasets by evaluating the variability in the utility of synthetic clinical trial data as variable order is randomly shuffled and implemented to find a good order if variability is too high. Expand
Evaluating the utility of synthetic COVID-19 case data
A gradient boosted classification tree was built to predict death using Ontario’s 90 514 COVID-19 case records linked with community comorbidity, demographic, and socioeconomic characteristics and could be used as a proxy for the real dataset. Expand
Applying Kernel Change Point Detection to Financial Markets
Test for non-negligible adverse shifts
This work proposes a framework to detect adverse shifts based on outlier scores, D-SOS, which is uniquely tailored to serve as a robust metric for model monitoring and data validation. Expand


Classification Accuracy as a Proxy for Two Sample Testing
This work proves two results that hold for all classifiers in any dimensions: if its true error remains $\epsilon-better than chance for some $\epSilon>0$ as $d,n \to \infty$, then (a) the permutation-based test is consistent (has power approaching to one), and (b) a computationally efficient test based on a Gaussian approximation of the null distribution is also consistent. Expand
Revisiting Classifier Two-Sample Tests
The properties, performance, and uses of C2ST are established and their main theoretical properties are analyzed, and their use to evaluate the sample quality of generative models with intractable likelihoods, such as Generative Adversarial Networks, are proposed. Expand
Consistency of Random Forests and Other Averaging Classifiers
A number of theorems are given that establish the universal consistency of averaging rules, and it is shown that some popular classifiers, including one suggested by Breiman, are not universally consistent. Expand
Optimal kernel choice for large-scale two-sample tests
The new kernel selection approach yields a more powerful test than earlier kernel selection heuristics, and makes the kernel selection and test procedures suited to data streams, where the observations cannot all be stored in memory. Expand
A Kernel Two-Sample Test
This work proposes a framework for analyzing and comparing distributions, which is used to construct statistical tests to determine if two samples are drawn from different distributions, and presents two distribution free tests based on large deviation bounds for the maximum mean discrepancy (MMD). Expand
Fast Two-Sample Testing with Analytic Representations of Probability Measures
A class of nonparametric two-sample tests with a cost linear in the sample size based on an ensemble of distances between analytic functions representing each of the distributions that give a better power/time tradeoff than competing approaches and in some cases better outright power than even the most expensive quadratic-time tests. Expand
An Empirical Study of Learning from Imbalanced Data Using Random Forest
A comprehensive suite of experiments that analyze the performance of the random forest (RF) learner implemented in Weka are discussed, providing an extensive empirical evaluation of RF learners built from imbalanced data. Expand
B-test: A Non-parametric, Low Variance Kernel Two-sample Test
The B-test uses a smaller than quadratic number of kernel evaluations and avoids completely the computational burden of complex null-hypothesis approximation while maintaining consistency and probabilistically conservative thresholds on Type I error. Expand
Random Forests
  • L. Breiman
  • Mathematics, Computer Science
  • Machine Learning
  • 2004
Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression. Expand
Do we need hundreds of classifiers to solve real world classification problems?
The random forest is clearly the best family of classifiers (3 out of 5 bests classifiers are RF), followed by SVM (4 classifiers in the top-10), neural networks and boosting ensembles (5 and 3 members in theTop-20, respectively). Expand