Rapid Approximate Aggregation with Distribution-Sensitive Interval Guarantees

@article{Macke2021RapidAA,
  title={Rapid Approximate Aggregation with Distribution-Sensitive Interval Guarantees},
  author={Stephen Macke and Maryam Aliakbarpour and Ilias Diakonikolas and Aditya G. Parameswaran and Ronitt Rubinfeld},
  journal={2021 IEEE 37th International Conference on Data Engineering (ICDE)},
  year={2021},
  pages={1703-1714}
}
Aggregating data is fundamental to data analytics, data exploration, and OLAP. Approximate query processing (AQP) techniques are often used to accelerate computation of aggregates using samples, for which confidence intervals (CIs) are widely used to quantify the associated error. CIs used in practice fall into two categories: techniques that are tight but not correct, i.e., they yield tight intervals but only offer asymptoticguarantees,makingthem unreliable, or techniques that are correct but… 

Figures and Tables from this paper

Differentially Private Online Aggregation
TLDR
This work develops a family of differentially private mechanisms, which includes the optimal Gap mechanisms, for answering AVG, COUNT, and SUM queries with WHERE conditions, and develops various optimizations to improve the accuracy of the Gap mechanism and empirically confirm that the Gap mechanisms preform the best overall.

References

SHOWING 1-10 OF 79 REFERENCES
Sample + Seek: Approximating Aggregates with Distribution Precision Guarantee
TLDR
A novel sampling scheme called measure-biased sampling is proposed to address the main challenges to provide rigorous error guarantees and to handle arbitrary highly selective predicates without maintaining large-sized samples and two new indexes to augment in-memory samples are proposed.
Large-sample and deterministic confidence intervals for online aggregation
  • P. Haas
  • Computer Science, Mathematics
    Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150)
  • 1997
TLDR
It is shown how new and existing central limit theorems, simple bounding arguments, and the delta method can be used to derive formulas for both large sample and deterministic confidence intervals, which contain the final query result with probability 1.
DAQ: A New Paradigm for Approximate Query Processing
TLDR
Deterministic approximate querying schemes are proposed, a closed deterministic approximation algebra is formalized, and some design principles for DAQ schemes are outlined, which deliver speedups over exact aggregation and predicate evaluation, and outperforms sampling-based schemes for extreme value aggregations.
The analytical bootstrap: a new method for fast error estimation in approximate query processing
TLDR
This paper introduces a probabilistic relational model for the bootstrap process, along with rigorous semantics and a unified error model, which bridges the gap between these two traditional approaches.
Approximate Query Processing: What is New and Where to Go?
TLDR
The survey can help the partitioners to understand existing AQP techniques and select appropriate methods in their applications and provide research challenges and opportunities of AQP.
Relational confidence bounds are easy with the bootstrap
TLDR
This paper considers the problem of incorporating into a database system a powerful "plug-in" method for computing confidence bounds on the answer to relational database queries over sampled or incomplete data and argues that the algorithms presented should be incorporated into any database system which is intended to support analytic processing.
Ripple joins for online aggregation
TLDR
It is shown how ripple joins can be implemented in an existing DBMS using iterators, and an overview of the methods used to compute confidence intervals and to adaptively optimize the ripple join “aspect-ratio” parameters are given.
G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data
TLDR
G-OLA, a novel mini-batch execution model that generalizes OLA to support general OLAP queries with arbitrarily nested aggregates using efficient delta maintenance techniques is implemented in FluoDB, a parallel online query execution framework that is built on top of the Spark cluster computing framework that can scale to massive data sets.
Optimized stratified sampling for approximate query processing
TLDR
This work treats the problem as an optimization problem where, given a workload of queries, a stratified random sample of the original data is selected such that the error in answering the workload queries using the sample is minimized.
Hoeffding inequalities for join-selectivity estimation and online aggregation
TLDR
The new results can be used to modify the asymptotically based sampling based procedures of Haas Naughton Seshadri and Swami so that there is a guaranteed upper bound on the number of sampling steps and developed conservative intervals for online aggregation avoid the large intermediate storage requirements and undercoverage problems of intervals based on large sample theory.
...
...