Skew strikes back: new developments in the theory of join algorithms

@article{Ngo2014SkewSB,
  title={Skew strikes back: new developments in the theory of join algorithms},
  author={Hung Quoc Ngo and Christopher R{\'e} and Atri Rudra},
  journal={SIGMOD Rec.},
  year={2014},
  volume={42},
  pages={5-16}
}
Evaluating the relational join is one of the central algorithmic and most well-studied problems in database systems. A staggering number of variants have been considered including Block-Nested loop join, Hash-Join, Grace, Sort-merge (see Grafe [17] for a survey, and [4, 7, 24] for discussions of more modern issues). Commercial database engines use finely tuned join heuristics that take into account a wide variety of factors including the selectivity of various predicates, memory, IO, etc. This… 

Figures from this paper

Worst-Case Optimal Join Algorithms: Techniques, Results, and Open Problems
  • H. Ngo
  • Computer Science
    PODS
  • 2018
TLDR
This paper discusses the key techniques for proving runtime and output size bounds, and focuses on the fascinating connection between join algorithms and information theoretic inequalities, and the idea of how one can turn a proof into an algorithm.
Combining Worst-Case Optimal and TraditionalBinary Join Processing
TLDR
This paper presents a comprehensive implementation approach for worst-case optimal joins that is practical within general-purpose relational database management systems supporting both hybrid transactional and analytical workloads and implements a hybrid query optimizer that intelligently and transparently combines both binary and multi-way joins within the same query plan.
Adopting worst-case optimal joins in relational database systems
TLDR
This paper presents a comprehensive implementation approach for worst-case optimal joins that is practical within general-purpose relational database management systems supporting both hybrid transactional and analytical workloads and implements a hybrid query optimizer that intelligently and transparently combines both binary and multi-way joins within the same query plan.
Join Processing for Graph Patterns: An Old Dog with New Tricks
TLDR
It is found that classical relational databases like Postgres and MonetDB or newer graph databases/stores like Virtuoso and Neo4j may be orders of magnitude slower than these new approaches compared to a fully featured RDBMS, LogicBlox, using these new ideas.
Fast Join Project Query Evaluation using Matrix Multiplication
TLDR
This paper studies how a class of join queries with projections can be evaluated faster using worst-case optimal algorithms together with matrix multiplication and indicates that matrix multiplication is a useful operation that can help speed up join processing owing to highly optimized open source libraries that are also highly parallelizable.
GHD-optimal join queries in compact space
  • Computer Science
  • 2021
TLDR
This thesis aims to significantly reduce the space required by WCO join algorithms in graph databases, allowing processes to be stored in memory higher up in the memory hierarchy and thus optimizing query times, and would also enable querying large datasets.
Worst-Case Optimal Radix Triejoin
TLDR
This paper presents a simple worst-case optimal multi-way join algorithm called the radix triejoin, which uses a binary encoding for reducing the domain of a database and generalises the core algorithm to conjunctive queries with inequality constraints and provides a new proof technique for the worst- case optimal join result.
Links between Join Processing and Convex Geometry
  • C. Ré
  • Computer Science
    ICDT
  • 2014
TLDR
This talk will survey some results on join processing that use inequalities from convex geometry, and simplified the algorithms and the arguments for such worst-case optimal join algorithms, including the LeapFrog TrieJoin, a worst- case optimizer from LogicBlox.
Let's Rethink Join Optimization in Distributed Systems
TLDR
It is argued that there is a promising opportunity to implement and experiment with a new set of join algorithms in distributed systems that are ill-suited and suboptimal for more complex sparse data that many modern applications process.
Random Sampling over Joins Revisited
TLDR
A general framework for random sampling over multi-way joins is proposed, which includes the algorithm of Chaudhuri et al. as a special case and several ways to instantiate this framework are explored, depending on what prior information is available about the underlying data, and offer different tradeoffs between sample generation latency and throughput.
...
...

References

SHOWING 1-10 OF 53 REFERENCES
Size Bounds and Query Plans for Relational Joins
TLDR
This work studies relational joins from a theoretical perspective and shows that there exist queries for which the join-project plan suggested by the fractional edge cover approach may be substantially better than any join plan that does not use intermediate projections.
Leapfrog Triejoin: a worst-case optimal join algorithm
TLDR
This paper improves on the results of NPRR by proving that leapfrog triejoin achieves worst-case optimality for finer-grained classes of database instances, such as those defined by constraints on projection cardinalities, and shows that NPRR is not best-case optimal for such classes.
Design and evaluation of main memory hash join algorithms for multi-core CPUs
TLDR
A very simple hash join algorithm is very competitive to the other more complex methods, and improves dramatically as the skew in the input data increases, and it quickly starts to outperform all other algorithms.
Handling data skew in parallel joins in shared-nothing systems
TLDR
This work proposes a new join geography called PRPD (Partial Redistribution & Partial Duplication) to improve the performance and scalability of parallel joins in the presence of data skew in a shared-nothing system.
Sort vs. Hash Revisited: Fast Join Implementation on Modern Multi-Core CPUs
TLDR
This paper re-examines two popular join algorithms to determine if the latest computer architecture trends shift the tide that has favored hash join for many years and offers multicore implementations of hash join and sort-merge join which consistently outperform all previously reported results.
On the complexity of database queries (extended abstract)
TLDR
It is shown that, if the query size (or the number of variables in the query) is considered as a parameter, then the relational calculus and its fragments are classified at appropriate levels of the so-called W hierarchy of Downey and Fellows.
Practical Skew Handling in Parallel Joins
TLDR
This work developed, implemented, and experimented with four new skew-handling parallel join algorithms, one of which, which is called virtual processor range partitioning, was the clear winner in high skew cases, while traditional hybrid hash join was theclear winner in lower skew or no skew cases.
Conjunctive Query Containment Revisited
A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins
TLDR
A taxonomy of skew effects is developed, and a new performance model is introduced that is used to compare the performance of two parallel join algorithms.
On the Complexity of Database Queries
TLDR
It is shown that, if the query size (or the number of variables in the query) is considered as a parameter, then the relational calculus and its fragments are classified at appropriate levels of the so-called W hierarchy of Downey and Fellows.
...
...