A scalable bootstrap for massive data

@article{Kleiner2011ASB,
  title={A scalable bootstrap for massive data},
  author={Ariel Kleiner and Ameet S. Talwalkar and Purnamrita Sarkar and Michael I. Jordan},
  journal={Journal of the Royal Statistical Society: Series B (Statistical Methodology)},
  year={2011},
  volume={76}
}
The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large data sets—which are increasingly prevalent—the calculation of bootstrap‐based quantities can be prohibitively demanding computationally. Although variants such as subsampling and the m out of n bootstrap can be used in principle to reduce the cost of bootstrap computations, these methods are generally not robust to specification of tuning parameters (such as the number… 

Figures and Tables from this paper

The Big Data Bootstrap
TLDR
The Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to obtain a robust, computationally efficient means of assessing estimator quality, is presented.
A Subsampled Double Bootstrap for Massive Data
TLDR
A new resampling method, the subsampled double bootstrap, is proposed, which is superior to BLB in terms of running time, more sample coverage, and automatic implementation with less tuning parameters for a given time budget.
SFB 823 A subsampled double bootstrap for massive data
TLDR
A new resampling method, the subsampled double bootstrap, is proposed, which is superior to BLB in terms of running time, more sample coverage and automatic implementation with less tuning parameters for a given time budget.
Fast and robust bootstrap in analysing large multivariate datasets
TLDR
The proposed bootstrap method facilitates using highly robust statistical methods in analyzing large scale data sets with significant savings in computation since the method does not require recomputing the estimator for each bootstrap sample but it is done analytically using a smart approximation.
Robust, Scalable, and Fast Bootstrap Method for Analyzing Large Scale Data
TLDR
This paper proposes a scalable, statistically robust and computationally efficient bootstrap method, compatible with distributed processing and storage systems and demonstrates scalability, low complexity and robust statistical performance of the method in analyzing large data sets.
Scalable Statistical Inference Using Distributed Bootstrapping And Iterative ℓ1-Norm Minimization
TLDR
This paper proposes a scalable distributed boot- strap method that uses iterative estimation equations favoring sparse solution and gives smaller Root MSE and significantly lower bias than bootstrap employing widely used sparse estimator BPDN.
Hyperparameter Selection for Subsampling Bootstraps
TLDR
A hyperparameter selection methodology is developed, which can be used to select tuning parameters for subsampling methods and finds an analytically simple and elegant relationship between the asymptotic efficiency of various subsampled estimators and their hyperparameters.
Sparsity-promoting bootstrap method for large-scale data
TLDR
A scalable nonparametric bootstrap method that operates with smaller number of distinct data points on multiple disjoint subsets of data and is compatible with distributed storage systems and distributed and parallel processing architectures is proposed.
A Cheap Bootstrap Method for Fast Inference
TLDR
This work presents a bootstrap methodology that uses minimal computation, namely with a resample effort as low as one Monte Carlo replication, while maintaining desirable statistical guarantees.
Variable Selection with Scalable Bootstrap in Generalized Linear Model for Massive Data
TLDR
This paper proposes the method of Variable Selection with Bag of Little Bootstraps (BLBVS) on General Linear Regression and extends it to Generalized Linear Model for selecting important parameters and assessing the quality of estimators' computation efficiency by analyzing results of multiple bootstrap sub-samples.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 33 REFERENCES
Richardson Extrapolation and the Bootstrap
Abstract Simulation methods [particularly Efron's (1979) bootstrap] are being applied more and more frequently in statistical inference. Given data (X 1 …, Xn ) distributed according to P, which
More Efficient Bootstrap Computations
Abstract This article concerns computational methods for the bootstrap that are more efficient than the straightforward Monte Carlo methods usually used. The bootstrap is considered in its simplest
The Jackknife and the Bootstrap for General Stationary Observations
We extend the jackknife and the bootstrap method of estimating standard errors to the case where the observations form a general stationary sequence. We do not attempt a reduction to i.i.d. values.
The stationary bootstrap
Abstract This article introduces a resampling procedure called the stationary bootstrap as a means of calculating standard errors of estimators and constructing confidence regions for parameters
How Many Bootstraps
TLDR
This document proposes an adaptive sequential method that estimates the accuracy of the bootstrap based on the current bootstrap samples until the estimated accuracy is high enough.
ON THE CHOICE OF m IN THE m OUT OF n BOOTSTRAP AND CONFIDENCE BOUNDS FOR EXTREMA
For i.i.d. samples of size n, the ordinary bootstrap (Efron (1979)) is known to be consistent in many situations, but it may fail in important examples (Bickel, Gotze and van Zwet (1997)). Using
A note on methods of restoring consistency to the bootstrap
We consider the property of consistency and its relevance for determining the performance of the bootstrap. We analyse various parametric bootstrap approximations to the distributions of the Hodges
Bootstrapping General Empirical Measures
It is proved that the bootstrapped central limit theorem for empirical processes indexed by a class of functions F and based on a probability measure P holds a.s. if and only if F CLT (P ) and ∫ F dP
Extrapolation and the bootstrap
The m out of n bootstrap, with or without replacement, where m→∞ and m/n→ 0 has been proposed on two grounds: (i) As a way of ensuring consistency when the classical bootstrap is not consistent. (ii)
Computer Intensive Methods in Statistics
TLDR
Four topics that have been treated in more detail were: Bayesian Computing; Interfacing Statistics and Computers; Image Analysis; Resampling Methods.
...
1
2
3
4
...