• Corpus ID: 88522386

What is the distribution of the number of unique original items in a bootstrap sample

  title={What is the distribution of the number of unique original items in a bootstrap sample},
  author={Alex F. Mendelson and Maria A. Zuluaga and Brian F. Hutton and S{\'e}bastien Ourselin},
  journal={arXiv: Machine Learning},
Sampling with replacement occurs in many settings in machine learning, notably in the bagging ensemble technique and the .632+ validation scheme. The number of unique original items in a bootstrap sample can have an important role in the behaviour of prediction models learned on it. Indeed, there are uncontrived examples where duplicate items have no effect. The purpose of this report is to present the distribution of the number of unique original items in a bootstrap sample clearly and… 

Figures from this paper

An Adaptively Resized Parametric Bootstrap for Inference in High-dimensional Generalized Linear Models

It is demonstrated that the resized bootstrap method yields valid confidence intervals in both simulated and real data examples, and the methods extend to other high-dimensional generalized linear models.

On Uncertainty Estimation by Tree-based Surrogate Models in Sequential Model-based Optimization

A new way of con-structing an ensemble of randomized trees is proposed, referred to as BwO forest, where bagging with oversampling is employed to construct boot-strapped samples that are used to build randomized trees with random splitting.

Regularization Guarantees Generalization in Bayesian Reinforcement Learning through Algorithmic Stability

This work shows that regularized MDPs satisfy a certain quadratic growth criterion, which is sufficient to establish stability, and allows us to study the effect of regularization on generalization in the Bayesian RL setting.

Sub-sampling for Efficient Non-Parametric Bandit Exploration

In this paper we propose the first multi-armed bandit algorithm based on re-sampling that achieves asymptotically optimal regret simultaneously for different families of arms (namely Bernoulli,

Inferring the Joint Demographic History of Multiple Populations: Beyond the Diffusion Approximation

A tractable model of ordinary differential equations for the evolution of allele frequencies that is closely related to the diffusion approximation but avoids many of its limitations and approximations is proposed.

Estimating helminth burdens using sibship reconstruction

This work developed a novel statistical method for estimating female worm burdens from data on the number of unique female parental genotypes derived from sibship reconstruction that represents a step towards a wider scope of application of parentage analysis techniques.

Reliable BIER With Peer Caching

Results indicate that local peer recovery is able to substantially reduce the overall retransmission traffic, and that this can be achieved through simple policies, where no signalling is required to build a set of candidate peers.

Bootstraps Regularize Singular Correlation Matrices

I show analytically that the average of $k$ bootstrapped correlation matrices rapidly becomes positive-definite as $k$ increases, which provides a simple approach to regularize singular Pearson



Analyzing Bagging

This work formalizes the notion of instability and derive theoretical results to analyze the variance reduction effect of bagging (or variants thereof) in mainly hard decision problems, which include estimation after testing in regression and decision trees for regression functions and classifiers.

Improvements on Cross-Validation: The 632+ Bootstrap Method

It is shown that a particular bootstrap method, the .632+ rule, substantially outperforms cross-validation in a catalog of 24 simulation experiments and also considers estimating the variability of an error rate estimate.

Bias of the Random Forest Out-of-Bag (OOB) Error for Certain Input Parameters

Simulation studies are performed to compare the effect of the input parameters on the predictive ability of the random forest, and it is found that the number of variables sampled, m-try, has the largest impact on the true prediction error.

Bootstrap by Sequential Resampling.

Estimating classification error rate: Repeated cross-validation, repeated hold-out and bootstrap

Bagging predictors

Tests on real and simulated data sets using classification and regression trees and subset selection in linear regression show that bagging can give substantial gains in accuracy.


In bagging, predictors are constructed using bootstrap samples from the training set and then aggregated to form a bagged predictor. Each bootstrap sample leaves out about 37% of the examples. These

Random Forests

Internal estimates monitor error, strength, and correlation and these are used to show the response to increasing the number of features used in the forest, and are also applicable to regression.

On the Failure of the Bootstrap for Matching Estimators

Matching estimators are widely used in empirical economics for the evaluation of programs or treatments. Researchers using matching methods often apply the bootstrap to calculate the standard errors.