Logic-Partition Based Gaussian Sampling for Online Aggregation

  title={Logic-Partition Based Gaussian Sampling for Online Aggregation},
  author={Longbin Zhang and Yuxiang Wang and Xiaoliang Xu},
  journal={2017 Fifth International Conference on Advanced Cloud and Big Data (CBD)},
Online aggregation is a commonly used technology to return approximate query results over random samples, which provides a fast way for users to obtain a trade-off between time and accuracy. The key issue of online aggregation is how to guarantee the efficiency and effectiveness of random sample collection. However, the state-of-the-art approaches either adopt the random sampling method or adopt the sequential sampling with preprocessing to obtain the uniform samples. The former one suffers… 
1 Citations

Figures and Tables from this paper

Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping Study
This paper intends to review the current state-of-the-art of interactive data exploration, aiming at satisfying three requirements: access to raw data files, stored in a distributed environment, and with a reasonably low latency.


Improving Online Aggregation Performance for Skewed Data Distribution
A Partition-based Online Aggregation System called POAS, which reduces the side effect of low selectivity by efficient pruning of unneeded data due to the partition and shuffle strategies, and the appropriate sample proportion can be achieved as far as possible by drawing samples from relevant partitions with dynamic sample size.
A Sampling-Based Hybrid Approximate Query Processing System in the Cloud
A hybrid approximate query processing model is proposed to improve the overall OLA performance, where a dynamic scheme switching mechanism is deliberately designed to switch unpromising OLA queries into the bootstrap scheme for further processing, avoiding the whole dataset scanning resulted from the OLA estimation failure.
The analytical bootstrap: a new method for fast error estimation in approximate query processing
This paper introduces a probabilistic relational model for the bootstrap process, along with rigorous semantics and a unified error model, which bridges the gap between these two traditional approaches.
Distributed Online Aggregation
The results show that the DoA scheme reduces the initial waiting time significantly and provides high quality approximate answers with running confidence intervals progressively and the scheme adaptively grows the number of processing nodes as the sample size increases.
Wander Join: Online Aggregation via Random Walks
This paper proposes a new approach, the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph, and designs an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori.
Continuous sampling for online aggregation over multiple queries
In COSMOS, a dataset is first scrambled so that sequentially scanning the dataset gives rise to a stream of random samples for all queries, which can potentially be used to compute the aggregates of descendent/dependent queries.
Efficient skew handling in online aggregation in the cloud
This paper proposes two methods to deal with the two special types of data skew in online aggregation in the cloud and implements these methods in a cloud online aggregation system called COLA and the experimental results demonstrate these methods can remarkably eliminate negative effect ofData skew and get better results.
OATS: online aggregation with two-level sharing strategy in cloud
This paper presents online aggregation with two-level sharing strategy in cloud (OATS) based on MapReduce framework to effectively support online aggregation for large scale concurrent query processing in skewed data distribution and proposes a heuristic algorithm for the statistical computation to share partial statistics calculation to decrease the number of final aggregation operations.
You can stop early with COLA: online processing of aggregate queries in the cloud
Cloud-based data management systems are emerging as scalable, fault-tolerant, and efficient solutions to manage large volumes of data with cost effective infrastructures, and more and more data
Large-sample and deterministic confidence intervals for online aggregation
  • P. Haas
  • Computer Science, Mathematics
    Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150)
  • 1997
It is shown how new and existing central limit theorems, simple bounding arguments, and the delta method can be used to derive formulas for both large sample and deterministic confidence intervals, which contain the final query result with probability 1.