• Corpus ID: 88522621

Partitioned Cross-Validation for Divide-and-Conquer Density Estimation

@article{Bhattacharya2016PartitionedCF,
  title={Partitioned Cross-Validation for Divide-and-Conquer Density Estimation},
  author={Anirban Bhattacharya and Jeffrey D. Hart},
  journal={arXiv: Methodology},
  year={2016}
}
We present an efficient method to estimate cross-validation bandwidth parameters for kernel density estimation in very large datasets where ordinary cross-validation is rendered highly inefficient, both statistically and computationally. Our approach relies on calculating multiple cross-validation bandwidths on partitions of the data, followed by suitable scaling and averaging to return a partitioned cross-validation bandwidth for the entire dataset. The partitioned cross-validation approach… 
1 Citations
Bagging cross-validated bandwidths with application to big data
Hall & Robinson (2009) proposed and analysed the use of bagged cross-validation to choose the bandwidth of a kernel density estimator. They established that bagging greatly reduces the noise

References

SHOWING 1-10 OF 16 REFERENCES
Biased and Unbiased Cross-Validation in Density Estimation
Abstract Nonparametric density estimation requires the specification of smoothing parameters. The demands of statistical objectivity make it highly desirable to base the choice on properties of the
Partitioned cross-validation
Partitioned cross-validation is proposed as a method for overcoming the large amounts of across sample variability to which ordinary cross-validation is subject. The price for cutting down on the
Extent to which least-squares cross-validation minimises integrated square error in nonparametric density estimation
SummaryLet ho, ĥo and ĥc be the windows which minimise mean integrated square error, integrated square error and the least-squares cross-validatory criterion, respectively, for kernel density
Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates
TLDR
It is established that despite the computational speed-up, statistical optimality is retained: as long as m is not too large, the partition-based estimator achieves the statistical minimax rate over all estimators using the set of N samples.
A PARTIALLY LINEAR FRAMEWORK FOR MASSIVE HETEROGENEOUS DATA.
TLDR
An aggregation type estimator for the commonality parameter is proposed that possesses the (non-asymptotic) minimax optimal bound and asymptotic distribution as if there were no heterogeneity.
Statistical inference in massive data sets
TLDR
The proposed approach significantly reduces the required amount of primary memory, and the resulting estimate will be as efficient if the entire data set was analyzed simultaneously.
Approximations of Markov Chains and High-Dimensional Bayesian Inference
The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov
On statistics, computation and scalability
TLDR
Some of the statistical consequences of computational perspectives on scability, in particular divide-and-conquer methodology and hierarchies of convex relaxations are investigated, with the goal of identifying “time-data tradeoffs.
A scalable bootstrap for massive data
TLDR
The ‘bag of little bootstraps’ (BLB) is introduced, which is a new procedure which incorporates features of both the bootstrap and subsampling to yield a robust, computationally efficient means of assessing the quality of estimators.
Bayes and big data: the consensus Monte Carlo algorithm
TLDR
A useful definition of ‘big data’ is data that is too big to process comfortably on a single machine, either because of processor, memory, or disk bottlenecks, so there is a need for algorithms that perform distributed approximate Bayesian analyses with minimal communication.
...
1
2
...