A scalable bootstrap for massive data

  title={A scalable bootstrap for massive data},
  author={Ariel Kleiner and Ameet S. Talwalkar and Purnamrita Sarkar and Michael I. Jordan},
  journal={Journal of The Royal Statistical Society Series B-statistical Methodology},
type="main" xml:id="rssb12050-abs-0001"> The bootstrap provides a simple and powerful means of assessing the quality of estimators. However, in settings involving large data sets—which are increasingly prevalent—the calculation of bootstrap-based quantities can be prohibitively demanding computationally. Although variants such as subsampling and the m out of n bootstrap can be used in principle to reduce the cost of bootstrap computations, these methods are generally not robust to specification… Expand

Figures and Tables from this paper

The Big Data Bootstrap
The Bag of Little Bootstraps (BLB), a new procedure which incorporates features of both the bootstrap and subsampling to obtain a robust, computationally efficient means of assessing estimator quality, is presented. Expand
Robust, Scalable, and Fast Bootstrap Method for Analyzing Large Scale Data
This paper proposes a scalable, statistically robust and computationally efficient bootstrap method, compatible with distributed processing and storage systems and demonstrates scalability, low complexity and robust statistical performance of the method in analyzing large data sets. Expand
A Subsampled Double Bootstrap for Massive Data
A new resampling method, the subsampled double bootstrap, is proposed, which is superior to BLB in terms of running time, more sample coverage, and automatic implementation with less tuning parameters for a given time budget. Expand
SFB 823 A subsampled double bootstrap for massive data
The bootstrap is a popular and powerful method for assessing precision of estimators and inferential methods. However, for massive datasets which are increasingly prevalent, the bootstrap becomesExpand
Fast and robust bootstrap in analysing large multivariate datasets
The proposed bootstrap method facilitates using highly robust statistical methods in analyzing large scale data sets with significant savings in computation since the method does not require recomputing the estimator for each bootstrap sample but it is done analytically using a smart approximation. Expand
Scalable Statistical Inference Using Distributed Bootstrapping And Iterative ℓ1-Norm Minimization
This paper proposes a scalable distributed boot- strap method that uses iterative estimation equations favoring sparse solution and gives smaller Root MSE and significantly lower bias than bootstrap employing widely used sparse estimator BPDN. Expand
Support for scalable analytics over databases and data-streams
This thesis provides an improved bootstrap approach that uses the Bag of Little Bootstraps along with other recent advances in bootstrap and time- series theory to provide an effective Hadoop-based implementation for assessing a time-series sample quality. Expand
Hyperparameter Selection for Subsampling Bootstraps
A hyperparameter selection methodology is developed, which can be used to select tuning parameters for subsampling methods and finds an analytically simple and elegant relationship between the asymptotic efficiency of various subsampled estimators and their hyperparameters. Expand
Sparsity-promoting bootstrap method for large-scale data
A scalable nonparametric bootstrap method that operates with smaller number of distinct data points on multiple disjoint subsets of data and is compatible with distributed storage systems and distributed and parallel processing architectures is proposed. Expand
A Bootstrap Metropolis–Hastings Algorithm for Bayesian Analysis of Big Data
The so-called bootstrap Metropolis–Hastings (BMH) algorithm is proposed, which provides a general framework for how to tame powerful MCMC methods to be used for big data analysis, that is, to replace the full data log-likelihood by a Monte Carlo average of the log- likelihoods that are calculated in parallel from multiple bootstrap samples. Expand


Richardson Extrapolation and the Bootstrap
Abstract Simulation methods [particularly Efron's (1979) bootstrap] are being applied more and more frequently in statistical inference. Given data (X 1 …, Xn ) distributed according to P, whichExpand
For i.i.d. samples of size n, the ordinary bootstrap (Efron (1979)) is known to be consistent in many situations, but it may fail in important examples (Bickel, Gotze and van Zwet (1997)). UsingExpand
More Efficient Bootstrap Computations
Abstract This article concerns computational methods for the bootstrap that are more efficient than the straightforward Monte Carlo methods usually used. The bootstrap is considered in its simplestExpand
The Jackknife and the Bootstrap for General Stationary Observations
We extend the jackknife and the bootstrap method of estimating standard errors to the case where the observations form a general stationary sequence. We do not attempt a reduction to i.i.d. values.Expand
The stationary bootstrap
Abstract This article introduces a resampling procedure called the stationary bootstrap as a means of calculating standard errors of estimators and constructing confidence regions for parametersExpand
Gap bootstrap methods for massive data sets with an application to transportation engineering
In this paper we describe two bootstrap methods for massive data sets. Naive applications of common resampling methodology are often impractical for massive data sets due to computational burden andExpand
Bootstrapping General Empirical Measures
It is proved that the bootstrapped central limit theorem for empirical processes indexed by a class of functions F and based on a probability measure P holds a.s. if and only if F CLT (P ) and ∫ F dPExpand
How Many Bootstraps
This document proposes an adaptive sequential method that estimates the accuracy of the bootstrap based on the current bootstrap samples until the estimated accuracy is high enough. Expand
An Introduction to the Bootstrap
15 Empirical Bayes Method, 2nd edition J.S. Maritz and T. Lwin (1989) Symmetric Multivariate and Related Distributions K.-T. Fang, S. Kotz and K. Ng (1989) Ieneralized Linear Models, 2nd edition P.Expand
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing
Resilient Distributed Datasets is presented, a distributed memory abstraction that lets programmers perform in-memory computations on large clusters in a fault-tolerant manner and is implemented in a system called Spark, which is evaluated through a variety of user applications and benchmarks. Expand