Wander Join: Online Aggregation via Random Walks
@article{Li2016WanderJO, title={Wander Join: Online Aggregation via Random Walks}, author={Feifei Li and Bin Wu and Ke Yi and Zhuoyue Zhao}, journal={Proceedings of the 2016 International Conference on Management of Data}, year={2016} }
Joins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and even needs unrealistic assumptions (e.g., tuples in a table are stored in random order). This paper proposes a new approach, the wander join algorithm…
Figures and Tables from this paper
113 Citations
Wander Join and XDB: Online Aggregation via Random Walks
- Computer ScienceSGMD
- 2017
This paper proposes a new approach, the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph, and designs an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori.
Random Sampling over Joins Revisited
- Computer ScienceSIGMOD Conference
- 2018
A general framework for random sampling over multi-way joins is proposed, which includes the algorithm of Chaudhuri et al. as a special case and several ways to instantiate this framework are explored, depending on what prior information is available about the underlying data, and offer different tradeoffs between sample generation latency and throughput.
Random Sampling and Size Estimation Over Cyclic Joins
- Computer ScienceICDT
- 2020
This paper presents the first non-trivial result on sampling over cyclic joins, showing that after a linear-time preprocessing step, a join result can be drawn uniformly at random in expected time O(IN/OUT), where IN is known as the AGM bound of the join and OUT is its output size.
Weighted Random Sampling over Joins
- Computer ScienceArXiv
- 2022
This work presents the first approach for weighted random sampling from join results, and exhibits qualities that are urgently needed in practice, namely reduced memory footprint, streaming operation, support for selections, outer joins, semi joins and anti joins and unequal-probability sampling.
Exploration of Knowledge Graphs via Online Aggregation
- Computer Science
- 2022
An algorithm for online aggregation that specializes in exploration queries on knowledge graphs is devised that leverages the low dimension of RDF graphs, and the low selectivity of exploration queries, by augmenting random walks with exact partial computations using a worst-case optimal join algorithm.
ApproxJoin: Approximate Distributed Joins
- Computer ScienceSoCC
- 2018
ApproxJoin interweave Bloom filter sketching and stratified sampling with the join computation in a new operator that preserves statistical properties of an aggregation over the join output, which achieves a speedup of up to 9x over unmodified Spark-based joins with the same sampling ratio.
PGMJoins: Random Join Sampling with Graphical Models
- Computer ScienceSIGMOD Conference
- 2021
PGMJoins adapts Probabilistic Graphical Models to deriving provably random samples of the join result for (n-way) key joins, many-to-many joins, and cyclic and acyclic joins and contributes optimizations both for deriving the structure of the graph and for PGM inference.
Bandit join: preliminary results
- Computer ScienceaiDM@SIGMOD
- 2020
In this approach, scan operators that precede a join, learn which parts of the relations are more likely to join during the query execution and produce more results faster by doing fewer I/O accesses.
Skew‐aware online aggregation over joins through guided sampling
- Computer ScienceConcurr. Comput. Pract. Exp.
- 2018
Online aggregation is a query processing technique that returns approximate answers with error guarantees (in the form of confidence intervals) continuously during the query execution process. This…
Glue: Adaptively Merging Single Table Cardinality to Estimate Join Query Size
- Computer ScienceArXiv
- 2021
A very general framework, called Glue, is proposed, to elegantly decouple the correlations across different tables and losslessly merge single tableCardEst results to estimate the join query size.
References
SHOWING 1-10 OF 58 REFERENCES
Ripple joins for online aggregation
- Computer ScienceSIGMOD '99
- 1999
It is shown how ripple joins can be implemented in an existing DBMS using iterators, and an overview of the methods used to compute confidence intervals and to adaptively optimize the ripple join “aspect-ratio” parameters are given.
Continuous sampling for online aggregation over multiple queries
- Computer ScienceSIGMOD Conference
- 2010
In COSMOS, a dataset is first scrambled so that sequentially scanning the dataset gives rise to a stream of random samples for all queries, which can potentially be used to compute the aggregates of descendent/dependent queries.
A scalable hash ripple join algorithm
- Computer ScienceSIGMOD '02
- 2002
Results from a prototype implementation in a parallel DBMS show that the new hash ripple join algorithm combines parallelism with sampling to speed convergence, and that when allowed to run to completion, even in the presence of memory overflow, it is competitive with the traditional parallel hybrid hash join algorithm.
On random sampling over joins
- Computer ScienceSIGMOD '99
- 1999
A detailed study of the inefficiency of sampling the output of a query, based on new insights into the interaction between join and sampling, and develops join sampling techniques for the settings where negative results do not apply.
Distributed Online Aggregation
- Computer ScienceProc. VLDB Endow.
- 2009
The results show that the DoA scheme reduces the initial waiting time significantly and provides high quality approximate answers with running confidence intervals progressively and the scheme adaptively grows the number of processing nodes as the sample size increases.
Large-sample and deterministic confidence intervals for online aggregation
- Computer Science, MathematicsProceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150)
- 1997
It is shown how new and existing central limit theorems, simple bounding arguments, and the delta method can be used to derive formulas for both large sample and deterministic confidence intervals, which contain the final query result with probability 1.
G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data
- Computer ScienceSIGMOD Conference
- 2015
G-OLA, a novel mini-batch execution model that generalizes OLA to support general OLAP queries with arbitrarily nested aggregates using efficient delta maintenance techniques is implemented in FluoDB, a parallel online query execution framework that is built on top of the Spark cluster computing framework that can scale to massive data sets.
Online aggregation
- Computer ScienceSIGMOD '97
- 1997
A new online aggregation interface is proposed that permits users to both observe the progress of their aggregation queries and control execution on the fly, and a suite of techniques that extend a database system to meet these requirements are presented.
Join Size Estimation Subject to Filter Conditions
- Computer Science, MathematicsProc. VLDB Endow.
- 2015
The proposed algorithm, Correlated Sampling, constructs a small space synopsis for each table, which can be used to provide a quick estimate of the join size of this table with other tables subject to dynamically specified predicate filter conditions.
ABS: a system for scalable approximate queries with accuracy guarantees
- Computer ScienceSIGMOD Conference
- 2014
The recently introduced Analytical Bootstrap method combines the strengths of both approaches and provides the basis for the ABS system, which will be demonstrated at the conference and its superior performance over the traditional approaches described above.