Design of a next generation sampling service for large scale data analysis applications


Advances in data collection and storage technologies have resulted in large and dynamically growing data sets at many organizations. Database and data mining researchers often use sampling with great effect to scale up performance on these data sets with small cost to accuracy. However, existing techniques often ignore the cost of computing a sample. This cost is often linear in the size of the data set, not the sample, which is expensive. Furthermore, for data mining applications that leverage progressive sampling or bootstrapping-based techniques, this cost can be prohibitive, since they require the generation of multiple samples.To address this problem, we present a solution in the context of a state-of-the-art data analysis center. Specifically, we propose a scalable service that supports sample generation with cost linear in the size of the sample. We then present an efficient parallelization of this service. Our solution leverages high speed interconnects (e.g. Myrinet, Infini-band) for parallel I/O operations with pipelined data transfers. We export an interface that supports both ad-hoc SQL-like querying for database applications, as well as a stand-alone service for data mining applications. We then evaluate our work using queries abstracted from a network monitoring and analysis application, which uses both database and progressive sampling queries. We demonstrate that our implementation achieves good load balance and realizes up to an order of magnitude speedup when compared with extant approaches.

DOI: 10.1145/1088149.1088162

Extracted Key Phrases

9 Figures and Tables

Cite this paper

@inproceedings{Wang2005DesignOA, title={Design of a next generation sampling service for large scale data analysis applications}, author={Huai Wang and Srinivasan Parthasarathy and Amol Ghoting and Shirish Tatikonda and Gregory Buehrer and Tahsin M. Kurç and Joel H. Saltz}, booktitle={ICS}, year={2005} }