Counting Distinct Elements in a Data Stream

@inproceedings{BarYossef2002CountingDE,
  title={Counting Distinct Elements in a Data Stream},
  author={Ziv Bar-Yossef and T. S. Jayram and Ravi Kumar and D. Sivakumar and Luca Trevisan},
  booktitle={RANDOM},
  year={2002}
}
We present three algorithms to count the number of distinct elements in a data stream to within a factor of 1 ± ?. Our algorithms improve upon known algorithms for this problem, and offer a spectrum of time/space tradeoffs. 
Counting distinct items over update streams
Sketching and streaming algorithms for processing massive data
TLDR
Techniques known as sketching and streaming for processing massive data both quickly and memory-efficiently are explored in this article.
Streaming Algorithms for Data in Motion
TLDR
Two new data stream models are proposed: the reset model and the delta model, motivated by applications to databases, and to tracking the location of spatial points, for tracking the "extent" of the points and Lp sampling.
Distinct-Values Estimation over Data Streams
TLDR
This work considers the problem of estimating the number of distinct values in a data stream with repeated values, improving the accuracy guarantees on the estimation, proving lower bounds, and considering other settings such as sliding windows, distributed streams, and sensor networks.
Aggregate Computation over Data Streams
TLDR
This paper provides survey for three important kinds of aggregate computations over data streams: frequency moment, frequency count and order statistic.
Range-Efficient Counting of Distinct Elements in a Massive Data Stream
TLDR
A randomized algorithm which yields an (e, d)-approximation of F_0, the number of distinct elements in a data stream where each element of the stream is not just a single integer but an interval of integers.
Streaming Algorithms for Robust Distinct Elements
TLDR
This paper formalizes the problem of robust distinct elements, and develops space and time-efficient streaming algorithms for datasets in the Euclidean space, using a novel technique the authors call bucket sampling, and extends the algorithmic framework to other metric spaces by establishing a connection between bucket sampling and the theory of locality sensitive hashing.
Model Counting Meets Distinct Elements in a Data Stream
TLDR
This work seeks to investigate whether bridging the seeming communication gap between the two communities may pave the way to richer fundamental insights in constraint satisfaction problems and data stream models.
Data Streams as Random Permutations: the Distinct Element Problem
TLDR
It is shown that data streams can sometimes usefully be studied as random permutations, and this is illustrated by introducing RECORDINALITY, an algorithm which estimates the number of distinct elements in a stream by counting thenumber of $k$-records occurring in it.
Distinct Sampling on Streaming Data with Near-Duplicates
TLDR
This paper studies how to perform distinct sampling in the streaming model where data contain near-duplicates, and presents algorithms with provable theoretical guarantees for datasets in the Euclidean space.
...
...

References

SHOWING 1-10 OF 11 REFERENCES
Probabilistic Counting Algorithms for Data Base Applications
Estimating simple functions on the union of data streams
TLDR
The distributed streams model is related to previously studied non-distributed (i.e., merged) streams models, presenting tight bounds on the gap between the distributed and merged models for deterministic algorithms, and employs a novel coordinated sampling technique to extract a sample of the union.
Reductions in streaming algorithms, with an application to counting triangles in graphs
TLDR
This work designs the first algorithm for the number of distinct elements in a data stream that achieves arbitrary approximation factors and develops the concept of list-efficient streaming algorithms that are essential to the design of efficient streaming algorithms through reductions.
A linear-time probabilistic counting algorithm for database applications
TLDR
A probabilistic algorithm for counting the number of unique values in the presence of duplicates, which has O(q) time complexity, and produces an estimation with an arbitrary accuracy prespecified by the user using only a small amount of space is presented.
The space complexity of approximating the frequency moments
TLDR
It turns out that the numbers F0;F1 and F2 can be approximated in logarithmic space, whereas the approximation of Fk for k 6 requires n (1) space.
Universal Classes of Hash Functions
Universal classes of hash functions (Extended Abstract)
TLDR
An input independent average linear time algorithm for storage and retrieval on keys that makes a random choice of hash function from a suitable class of hash functions.
Selectivity and Cost Estimation for Joins Based on Random Sampling
TLDR
A partial ordering that compares the variability of the estimators for the different procedures after an arbitrary fixed number of sampling steps and implies a partial ordering of the corresponding fixed-precision procedures with respect to sampling cost.
New classes and applications of hash functions
  • M. Wegman, L. Carter
  • Computer Science, Mathematics
    20th Annual Symposium on Foundations of Computer Science (sfcs 1979)
  • 1979
TLDR
Several new classes of hash functions with certain desirable properties are exhibited, and two novel applications for hashing which make use of these functions are introduced, including a provably secure authentication techniques for sending messages over insecure lines.
New Hash Functions and Their Use in Authentication and Set Equality
...
...