Tracking join and self-join sizes in limited storage
@inproceedings{Alon1999TrackingJA, title={Tracking join and self-join sizes in limited storage}, author={Noga Alon and Phillip B. Gibbons and Y. Matias and Mario Szegedy}, booktitle={PODS '99}, year={1999} }
Query optimizers rely on fast, high-quality estimates of result sizes in order to select between various join plans. Selfjoin sizes of relations provide bounds on the join size of any pairs of such relations. It also indicates the degree of skew in the data, and has been advocated for several estimation procedures. Exact computation of the self-join size requires storage proportional to, the number of distinct attribute values, which may be prohibitively large. In this paper, we study…
282 Citations
Similarity Join and Self-Join Size Estimation in a Streaming Environment
- Computer ScienceArXiv
- 2018
We study the problem of similarity self-join and similarity join size estimation in a streaming setting where the goal is to estimate, in one scan of the input and with sublinear space in the input…
Memory-Efficient Key/Foreign-Key Join Size Estimation via Multiplicity and Intersection Size
- Computer Science2021 IEEE 37th International Conference on Data Engineering (ICDE)
- 2021
This paper builds on a model by Allen Van Gelder, in which there is no notion of join selectivity, and presents both a data structure to approximate the number of distinct values in a join attribute after a filter operation, and formulas to estimate the factor by which a join size exceeds the intersection size.
The Sort-Merge-Shrink join
- Computer ScienceTODS
- 2006
The key innovation of the SMS join is that if the input data are clustered in a statistically random fashion on disk, then at all times, the join provides an online, statistical estimator for the eventual answer to the query as well as probabilistic confidence bounds.
Similarity Join and Similarity Self-Join Size Estimation in a Streaming Environment
- Computer ScienceIEEE Transactions on Knowledge and Data Engineering
- 2020
The results show that given the same space, the proposed algorithm has an order of magnitude less error for a large range of similarity thresholds and scales well with the input size.
Two-Level Sampling for Join Size Estimation
- Computer ScienceSIGMOD Conference
- 2017
This paper proposes a new sampling algorithm for join size estimation, called two-level sampling, which combines the advantages of three previous sampling methods while making further improvements.
Self-Join Size Estimation in Large-scale Distributed Data Systems
- Computer Science2008 IEEE 24th International Conference on Data Engineering
- 2008
This work tackles the open problem of self-join size estimation in a large-scale distributed data system, where tuples of a relation are distributed over data nodes which comprise an overlay network and develops analyses showing how Gini estimations can lead to estimations of the underlying Zipfian or power-law value distributions.
Join Size Estimation Subject to Filter Conditions
- Computer Science, MathematicsProc. VLDB Endow.
- 2015
The proposed algorithm, Correlated Sampling, constructs a small space synopsis for each table, which can be used to provide a quick estimate of the join size of this table with other tables subject to dynamically specified predicate filter conditions.
Random Sampling over Joins Revisited
- Computer ScienceSIGMOD Conference
- 2018
A general framework for random sampling over multi-way joins is proposed, which includes the algorithm of Chaudhuri et al. as a special case and several ways to instantiate this framework are explored, depending on what prior information is available about the underlying data, and offer different tradeoffs between sample generation latency and throughput.
New Estimation Algorithms for Streaming Data : Count-min Can Do More
- Computer Science
This paper proposes two new estimation algorithms for multiplicity queries and self-join size estimations, which significantly improve the estimation accuracies compared with the previous Count-min estimation algorithms when the data set is less skewed, exactly where the previous algorithms perform poorly.
Estimating Join Selectivities using Bandwidth-Optimized Kernel Density Models
- Computer ScienceProc. VLDB Endow.
- 2017
This paper introduces a modern, self-tuning selectivity estimator for range scans based on KDE that out-performs state-of-the-art multidimensional histograms and is efficient to evaluate on graphics cards and proposes two approaches to building a KDE model from a sample drawn from the join result.
References
SHOWING 1-10 OF 49 REFERENCES
Ripple joins for online aggregation
- Computer ScienceSIGMOD '99
- 1999
It is shown how ripple joins can be implemented in an existing DBMS using iterators, and an overview of the methods used to compute confidence intervals and to adaptively optimize the ripple join “aspect-ratio” parameters are given.
Join synopses for approximate query answering
- Computer ScienceSIGMOD '99
- 1999
This paper proposes join synopses as an effective solution for this problem and shows how precomputing just one join synopsis for each relation suffices to significantly improve the quality of approximate answers for arbitrary queries with foreign key joins.
Bifocal sampling for skew-resistant join size estimation
- MathematicsSIGMOD '96
- 1996
The estimate obtained by the bifocal sampling algorithm is proven to lie with high probability within a small constant factor of the actual join size, regardless of the skew, as long as the join size is Ω(n lg n), for relations consisting of n tuples.
Practical selectivity estimation through adaptive sampling
- Computer ScienceSIGMOD '90
- 1990
This paper extends the previous analysis to provide significantly improved bounds on the amount of sampling necessary for a given level of accuracy and provides “sanity bounds” to deal with queries for which the underlying data is extremely skewed or the query result is very small.
Balancing histogram optimality and practicality for query result size estimation
- Computer ScienceSIGMOD '95
- 1995
The overall conclusion is that the most effective approach is to focus on the class of histograms that accurately maintain the frequencies of a few attribute values and assume the uniform distribution for the rest, and choose for each relation the histogram in that class that is optimal for a self-join query.
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
- Computer ScienceVLDB
- 2001
This work presents an approach, called distinct sampling, that collects a specially tailored sample over the distinct values in the input, in a single scan of the data, and shows how it can provide fast, highlyaccurate approximate answers for “report” queries in high-volume, session-based event recording environments, such as IP networks, customer service call centers, etc.
Statistical estimators for relational algebra expressions
- Computer SciencePODS '88
- 1988
This paper designs a sampling plan based on the cluster sampling method to improve the utilization of sampled data and to reduce the cost of sampling, and proposes consistent and unbiased estimators for arbitrary COUNT(E) type queries.
ICICLES: Self-Tuning Samples for Approximate Query Answering
- Computer ScienceVLDB
- 2000
This paper introduces icicles, a new class of samples that tune themselves to a dynamic workload and shows, analytically, that for a certain class of queries reflected by the workload, icicles yield more accurate answers.
Histogram-Based Estimation Techniques in Database Systems
- Computer Science
- 1997
This thesis identifies (theoretically and experimentally) the most accurate classes of histograms for estimating the sizes and distributions of the results of several important query operators and provides efficient (sampling-based) techniques to construct these histograms.
New sampling-based summary statistics for improving approximate query answers
- Computer ScienceSIGMOD '98
- 1998
This paper introduces two new sampling-based summary statistics, concise samples and counting samples, and presents new techniques for their fast incremental maintenance regardless of the data distribution, and considers their application to providing fast approximate answers to hot list queries.