# Partitioned Cross-Validation for Divide-and-Conquer Density Estimation

@article{Bhattacharya2016PartitionedCF, title={Partitioned Cross-Validation for Divide-and-Conquer Density Estimation}, author={Anirban Bhattacharya and Jeffrey D. Hart}, journal={arXiv: Methodology}, year={2016} }

We present an efficient method to estimate cross-validation bandwidth parameters for kernel density estimation in very large datasets where ordinary cross-validation is rendered highly inefficient, both statistically and computationally. Our approach relies on calculating multiple cross-validation bandwidths on partitions of the data, followed by suitable scaling and averaging to return a partitioned cross-validation bandwidth for the entire dataset. The partitioned cross-validation approach…

## Figures and Tables from this paper

## One Citation

Bagging cross-validated bandwidths with application to big data

- Mathematics
- 2020

Hall & Robinson (2009) proposed and analysed the use of bagged cross-validation to choose the bandwidth of a kernel density estimator. They established that bagging greatly reduces the noise…

## References

SHOWING 1-10 OF 16 REFERENCES

Biased and Unbiased Cross-Validation in Density Estimation

- Mathematics
- 1987

Abstract Nonparametric density estimation requires the specification of smoothing parameters. The demands of statistical objectivity make it highly desirable to base the choice on properties of the…

Partitioned cross-validation

- Economics
- 1987

Partitioned cross-validation is proposed as a method for overcoming the large amounts of across sample variability to which ordinary cross-validation is subject. The price for cutting down on the…

Extent to which least-squares cross-validation minimises integrated square error in nonparametric density estimation

- Mathematics
- 1987

SummaryLet ho, ĥo and ĥc be the windows which minimise mean integrated square error, integrated square error and the least-squares cross-validatory criterion, respectively, for kernel density…

Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates

- Mathematics, Computer ScienceJ. Mach. Learn. Res.
- 2015

It is established that despite the computational speed-up, statistical optimality is retained: as long as m is not too large, the partition-based estimator achieves the statistical minimax rate over all estimators using the set of N samples.

A PARTIALLY LINEAR FRAMEWORK FOR MASSIVE HETEROGENEOUS DATA.

- Mathematics, MedicineAnnals of statistics
- 2016

An aggregation type estimator for the commonality parameter is proposed that possesses the (non-asymptotic) minimax optimal bound and asymptotic distribution as if there were no heterogeneity.

Statistical inference in massive data sets

- Computer Science
- 2012

The proposed approach significantly reduces the required amount of primary memory, and the resulting estimate will be as efficient if the entire data set was analyzed simultaneously.

Approximations of Markov Chains and High-Dimensional Bayesian Inference

- Mathematics
- 2015

The Markov Chain Monte Carlo method is the dominant paradigm for posterior computation in Bayesian analysis. It has long been common to control computation time by making approximations to the Markov…

On statistics, computation and scalability

- Mathematics, Computer ScienceArXiv
- 2013

Some of the statistical consequences of computational perspectives on scability, in particular divide-and-conquer methodology and hierarchies of convex relaxations are investigated, with the goal of identifying “time-data tradeoffs.

A scalable bootstrap for massive data

- Mathematics, Computer Science
- 2011

The ‘bag of little bootstraps’ (BLB) is introduced, which is a new procedure which incorporates features of both the bootstrap and subsampling to yield a robust, computationally efficient means of assessing the quality of estimators.

Bayes and big data: the consensus Monte Carlo algorithm

- Computer Science
- 2016

A useful definition of ‘big data’ is data that is too big to process comfortably on a single machine, either because of processor, memory, or disk bottlenecks, so there is a need for algorithms that perform distributed approximate Bayesian analyses with minimal communication.