Tracking join and self-join sizes in limited storage

@inproceedings{Alon1999TrackingJA,
  title={Tracking join and self-join sizes in limited storage},
  author={Noga Alon and Phillip B. Gibbons and Y. Matias and Mario Szegedy},
  booktitle={PODS '99},
  year={1999}
}
Query optimizers rely on fast, high-quality estimates of result sizes in order to select between various join plans. Selfjoin sizes of relations provide bounds on the join size of any pairs of such relations. It also indicates the degree of skew in the data, and has been advocated for several estimation procedures. Exact computation of the self-join size requires storage proportional to, the number of distinct attribute values, which may be prohibitively large. In this paper, we study… 
Similarity Join and Self-Join Size Estimation in a Streaming Environment
We study the problem of similarity self-join and similarity join size estimation in a streaming setting where the goal is to estimate, in one scan of the input and with sublinear space in the input
Memory-Efficient Key/Foreign-Key Join Size Estimation via Multiplicity and Intersection Size
TLDR
This paper builds on a model by Allen Van Gelder, in which there is no notion of join selectivity, and presents both a data structure to approximate the number of distinct values in a join attribute after a filter operation, and formulas to estimate the factor by which a join size exceeds the intersection size.
The Sort-Merge-Shrink join
TLDR
The key innovation of the SMS join is that if the input data are clustered in a statistically random fashion on disk, then at all times, the join provides an online, statistical estimator for the eventual answer to the query as well as probabilistic confidence bounds.
Similarity Join and Similarity Self-Join Size Estimation in a Streaming Environment
TLDR
The results show that given the same space, the proposed algorithm has an order of magnitude less error for a large range of similarity thresholds and scales well with the input size.
Two-Level Sampling for Join Size Estimation
TLDR
This paper proposes a new sampling algorithm for join size estimation, called two-level sampling, which combines the advantages of three previous sampling methods while making further improvements.
Self-Join Size Estimation in Large-scale Distributed Data Systems
TLDR
This work tackles the open problem of self-join size estimation in a large-scale distributed data system, where tuples of a relation are distributed over data nodes which comprise an overlay network and develops analyses showing how Gini estimations can lead to estimations of the underlying Zipfian or power-law value distributions.
Join Size Estimation Subject to Filter Conditions
TLDR
The proposed algorithm, Correlated Sampling, constructs a small space synopsis for each table, which can be used to provide a quick estimate of the join size of this table with other tables subject to dynamically specified predicate filter conditions.
Random Sampling over Joins Revisited
TLDR
A general framework for random sampling over multi-way joins is proposed, which includes the algorithm of Chaudhuri et al. as a special case and several ways to instantiate this framework are explored, depending on what prior information is available about the underlying data, and offer different tradeoffs between sample generation latency and throughput.
New Estimation Algorithms for Streaming Data : Count-min Can Do More
TLDR
This paper proposes two new estimation algorithms for multiplicity queries and self-join size estimations, which significantly improve the estimation accuracies compared with the previous Count-min estimation algorithms when the data set is less skewed, exactly where the previous algorithms perform poorly.
Estimating Join Selectivities using Bandwidth-Optimized Kernel Density Models
TLDR
This paper introduces a modern, self-tuning selectivity estimator for range scans based on KDE that out-performs state-of-the-art multidimensional histograms and is efficient to evaluate on graphics cards and proposes two approaches to building a KDE model from a sample drawn from the join result.
...
...

References

SHOWING 1-10 OF 49 REFERENCES
Ripple joins for online aggregation
TLDR
It is shown how ripple joins can be implemented in an existing DBMS using iterators, and an overview of the methods used to compute confidence intervals and to adaptively optimize the ripple join “aspect-ratio” parameters are given.
Join synopses for approximate query answering
TLDR
This paper proposes join synopses as an effective solution for this problem and shows how precomputing just one join synopsis for each relation suffices to significantly improve the quality of approximate answers for arbitrary queries with foreign key joins.
Bifocal sampling for skew-resistant join size estimation
TLDR
The estimate obtained by the bifocal sampling algorithm is proven to lie with high probability within a small constant factor of the actual join size, regardless of the skew, as long as the join size is Ω(n lg n), for relations consisting of n tuples.
Practical selectivity estimation through adaptive sampling
TLDR
This paper extends the previous analysis to provide significantly improved bounds on the amount of sampling necessary for a given level of accuracy and provides “sanity bounds” to deal with queries for which the underlying data is extremely skewed or the query result is very small.
Balancing histogram optimality and practicality for query result size estimation
TLDR
The overall conclusion is that the most effective approach is to focus on the class of histograms that accurately maintain the frequencies of a few attribute values and assume the uniform distribution for the rest, and choose for each relation the histogram in that class that is optimal for a self-join query.
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
TLDR
This work presents an approach, called distinct sampling, that collects a specially tailored sample over the distinct values in the input, in a single scan of the data, and shows how it can provide fast, highlyaccurate approximate answers for “report” queries in high-volume, session-based event recording environments, such as IP networks, customer service call centers, etc.
Statistical estimators for relational algebra expressions
TLDR
This paper designs a sampling plan based on the cluster sampling method to improve the utilization of sampled data and to reduce the cost of sampling, and proposes consistent and unbiased estimators for arbitrary COUNT(E) type queries.
ICICLES: Self-Tuning Samples for Approximate Query Answering
TLDR
This paper introduces icicles, a new class of samples that tune themselves to a dynamic workload and shows, analytically, that for a certain class of queries reflected by the workload, icicles yield more accurate answers.
Histogram-Based Estimation Techniques in Database Systems
TLDR
This thesis identifies (theoretically and experimentally) the most accurate classes of histograms for estimating the sizes and distributions of the results of several important query operators and provides efficient (sampling-based) techniques to construct these histograms.
New sampling-based summary statistics for improving approximate query answers
TLDR
This paper introduces two new sampling-based summary statistics, concise samples and counting samples, and presents new techniques for their fast incremental maintenance regardless of the data distribution, and considers their application to providing fast approximate answers to hot list queries.
...
...