Wander Join: Online Aggregation via Random Walks

@article{Li2016WanderJO,
  title={Wander Join: Online Aggregation via Random Walks},
  author={Feifei Li and Bin Wu and Ke Yi and Zhuoyue Zhao},
  journal={Proceedings of the 2016 International Conference on Management of Data},
  year={2016}
}
  • Feifei Li, Bin Wu, Zhuoyue Zhao
  • Published 14 June 2016
  • Computer Science
  • Proceedings of the 2016 International Conference on Management of Data
Joins are expensive, and online aggregation over joins was proposed to mitigate the cost, which offers users a nice and flexible tradeoff between query efficiency and accuracy in a continuous, online fashion. However, the state-of-the-art approach, in both internal and external memory, is based on ripple join, which is still very expensive and even needs unrealistic assumptions (e.g., tuples in a table are stored in random order). This paper proposes a new approach, the wander join algorithm… 
Wander Join and XDB: Online Aggregation via Random Walks
TLDR
This paper proposes a new approach, the wander join algorithm, to the online aggregation problem by performing random walks over the underlying join graph, and designs an optimizer that chooses the optimal plan for conducting the random walks without having to collect any statistics a priori.
Random Sampling over Joins Revisited
TLDR
A general framework for random sampling over multi-way joins is proposed, which includes the algorithm of Chaudhuri et al. as a special case and several ways to instantiate this framework are explored, depending on what prior information is available about the underlying data, and offer different tradeoffs between sample generation latency and throughput.
Random Sampling and Size Estimation Over Cyclic Joins
TLDR
This paper presents the first non-trivial result on sampling over cyclic joins, showing that after a linear-time preprocessing step, a join result can be drawn uniformly at random in expected time O(IN/OUT), where IN is known as the AGM bound of the join and OUT is its output size.
Weighted Random Sampling over Joins
TLDR
This work presents the first approach for weighted random sampling from join results, and exhibits qualities that are urgently needed in practice, namely reduced memory footprint, streaming operation, support for selections, outer joins, semi joins and anti joins and unequal-probability sampling.
Exploration of Knowledge Graphs via Online Aggregation
TLDR
An algorithm for online aggregation that specializes in exploration queries on knowledge graphs is devised that leverages the low dimension of RDF graphs, and the low selectivity of exploration queries, by augmenting random walks with exact partial computations using a worst-case optimal join algorithm.
ApproxJoin: Approximate Distributed Joins
TLDR
ApproxJoin interweave Bloom filter sketching and stratified sampling with the join computation in a new operator that preserves statistical properties of an aggregation over the join output, which achieves a speedup of up to 9x over unmodified Spark-based joins with the same sampling ratio.
PGMJoins: Random Join Sampling with Graphical Models
TLDR
PGMJoins adapts Probabilistic Graphical Models to deriving provably random samples of the join result for (n-way) key joins, many-to-many joins, and cyclic and acyclic joins and contributes optimizations both for deriving the structure of the graph and for PGM inference.
Bandit join: preliminary results
TLDR
In this approach, scan operators that precede a join, learn which parts of the relations are more likely to join during the query execution and produce more results faster by doing fewer I/O accesses.
Skew‐aware online aggregation over joins through guided sampling
Online aggregation is a query processing technique that returns approximate answers with error guarantees (in the form of confidence intervals) continuously during the query execution process. This
Glue: Adaptively Merging Single Table Cardinality to Estimate Join Query Size
TLDR
A very general framework, called Glue, is proposed, to elegantly decouple the correlations across different tables and losslessly merge single tableCardEst results to estimate the join query size.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 58 REFERENCES
Ripple joins for online aggregation
TLDR
It is shown how ripple joins can be implemented in an existing DBMS using iterators, and an overview of the methods used to compute confidence intervals and to adaptively optimize the ripple join “aspect-ratio” parameters are given.
Continuous sampling for online aggregation over multiple queries
TLDR
In COSMOS, a dataset is first scrambled so that sequentially scanning the dataset gives rise to a stream of random samples for all queries, which can potentially be used to compute the aggregates of descendent/dependent queries.
A scalable hash ripple join algorithm
TLDR
Results from a prototype implementation in a parallel DBMS show that the new hash ripple join algorithm combines parallelism with sampling to speed convergence, and that when allowed to run to completion, even in the presence of memory overflow, it is competitive with the traditional parallel hybrid hash join algorithm.
On random sampling over joins
TLDR
A detailed study of the inefficiency of sampling the output of a query, based on new insights into the interaction between join and sampling, and develops join sampling techniques for the settings where negative results do not apply.
Distributed Online Aggregation
TLDR
The results show that the DoA scheme reduces the initial waiting time significantly and provides high quality approximate answers with running confidence intervals progressively and the scheme adaptively grows the number of processing nodes as the sample size increases.
Large-sample and deterministic confidence intervals for online aggregation
  • P. Haas
  • Computer Science, Mathematics
    Proceedings. Ninth International Conference on Scientific and Statistical Database Management (Cat. No.97TB100150)
  • 1997
TLDR
It is shown how new and existing central limit theorems, simple bounding arguments, and the delta method can be used to derive formulas for both large sample and deterministic confidence intervals, which contain the final query result with probability 1.
G-OLA: Generalized On-Line Aggregation for Interactive Analysis on Big Data
TLDR
G-OLA, a novel mini-batch execution model that generalizes OLA to support general OLAP queries with arbitrarily nested aggregates using efficient delta maintenance techniques is implemented in FluoDB, a parallel online query execution framework that is built on top of the Spark cluster computing framework that can scale to massive data sets.
Online aggregation
TLDR
A new online aggregation interface is proposed that permits users to both observe the progress of their aggregation queries and control execution on the fly, and a suite of techniques that extend a database system to meet these requirements are presented.
Join Size Estimation Subject to Filter Conditions
TLDR
The proposed algorithm, Correlated Sampling, constructs a small space synopsis for each table, which can be used to provide a quick estimate of the join size of this table with other tables subject to dynamically specified predicate filter conditions.
ABS: a system for scalable approximate queries with accuracy guarantees
TLDR
The recently introduced Analytical Bootstrap method combines the strengths of both approaches and provides the basis for the ABS system, which will be demonstrated at the conference and its superior performance over the traditional approaches described above.
...
1
2
3
4
5
...