An optimal algorithm for the distinct elements problem

@inproceedings{Kane2010AnOA,
  title={An optimal algorithm for the distinct elements problem},
  author={Daniel M. Kane and Jelani Nelson and David P. Woodruff},
  booktitle={PODS '10},
  year={2010}
}
We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. [] Key Method This probability can be amplified by independent repetition.

Figures from this paper

Optimal streaming and tracking distinct elements with high probability
TLDR
This work provides an optimal algorithm using O( lg δ −1 ε2 +lg n) bits of space — matching known lower bounds for this problem, and settles completely the space complexity of the distinct elements problem with respect to all standard parameters.
Optimal Streaming and Tracking Distinct Elements with High Probability
TLDR
This work provides an optimal algorithm using O(lg δ−1/ε2 + lg n) bits of space—matching known lower bounds for this problem, and settles completely the space complexity of the distinct elements problem with respect to all standard parameters.
Tight bounds for data stream algorithms and communication problems
TLDR
This thesis gives efficient algorithms and near-tight lower bounds for the following problems in the streaming model: finding duplicates in data streams, improving the algorithms of Gopalan and Radhakrishnan from SODA’09, and a O(log n logm) space algorithm that works entirely in the Streaming model.
An Improved Interactive Streaming Algorithm for the Distinct Elements Problem
TLDR
The exact computation of the number of distinct elements (frequency moment F 0) is a fundamental problem in the study of data streaming algorithms and a model where the data stream is also processed by a powerful helper, who provides an interactive proof of the result.
An Asymptotically Optimal Algorithm for Maximum Matching in Dynamic Streams
We present an algorithm for the maximum matching problem in dynamic (insertion-deletions) streams with asymptotically optimal space: for any n-vertex graph, our algorithm with high probability
Streaming Algorithms for Robust Distinct Elements
TLDR
This paper formalizes the problem of robust distinct elements, and develops space and time-efficient streaming algorithms for datasets in the Euclidean space, using a novel technique the authors call bucket sampling, and extends the algorithmic framework to other metric spaces by establishing a connection between bucket sampling and the theory of locality sensitive hashing.
An Optimal Lower Bound for Distinct Elements in the Message Passing Model
TLDR
This work considers the setting in which each player holds a subset Si of elements of a universe of size n, and their goal is to output a (1 + e)-approximation to the total number of distinct elements in the union of the sets Si with constant probability, which can be amplified by independent repetition.
Better Streaming Algorithms for the Maximum Coverage Problem
TLDR
The main goal of this work is to design algorithms, with approximation guarantees as close as possible to 1−1/e$1-1/ e$, that use sublinear space o(mn)$o(mn), and to study the maximum k-vertex coverage problem in the dynamic graph stream model.
Optimal streaming and tracking distinct elements with high probability
TLDR
This work provides an optimal algorithm using $\mathcal{O}(\frac{\log \delta^{-1}}{\varepsilon^2} + \log n)$ bits of space, which it is shown to be optimal.
An Optimal Algorithm for Large Frequency Moments Using O(n^(1-2/k)) Bits
TLDR
This paper provides an upper bound on the space required to find a k-th frequency moment of O(n^(1-2/k) bits that matches, up to a constant factor, the lower bound of Woodruff et.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 50 REFERENCES
Algorithms for dynamic geometric problems over data streams
TLDR
This paper presents low-storage data structures that maintain approximate solutions to geometric problems, under insertions and deletions of points (this is called a turnstile model in [24]).
HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm
TLDR
This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of \emphdistinct elements (the cardinality) of very large data ensembles, and makes it possible to estimate cardinalities well beyond $10^9$ with a typical accuracy of 2% while using a memory of only 1.5 kilobytes.
Periodicity in Streams
TLDR
A 1-pass randomized streaming algorithm that uses O(log2 n) space and reports the shortest period if the given stream is periodic, and a randomized streaming algorithms with approximation factor 2 + e that takes O(1/e2) space.
A Multi-Round Communication Lower Bound for Gap Hamming and Some Consequences
TLDR
It is concluded, for instance, that $\epsilon$-approximately counting the number of distinct elements in a data stream requires $\Omega(1/\ep silon^2)$ space, even with multiple (a constant number of) passes over the input stream.
Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies
TLDR
Three strategies for estimating the storage blowup that will result from a proposed set of precomputations without actually computing them are proposed: one based on sampling, onebased on mathematical approximation, and one based upon probabilistic counting.
Bitmap Algorithms for Counting Active Flows on High-Speed Links
TLDR
A family of bitmap algorithms that address the problem of counting the number of distinct header patterns (flows) seen on a high-speed link and can be used to detect DoS attacks and port scans and to solve measurement problems.
Uniform Hashing in Constant Time and Optimal Space
TLDR
This paper presents an almost ideal solution to this problem: a hash function h: U: Uarrow V that, on any set of $n$ inputs, behaves like a truly random function with high probability, can be evaluated in constant time on a RAM and can be stored in $(1+\epsilon)n\log |V| + O(n+\log \log |U|)$ bits.
Loglog Counting of Large Cardinalities (Extended Abstract)
TLDR
The LogLog algorithm makes use of m "small bytes" of auxiliary memory in order to estimate in a single pass the number of distinct elements (the "cardinality") in a file, and it does so with an accuracy that is of the order of 1/ √ m.
Estimating simple functions on the union of data streams
TLDR
The distributed streams model is related to previously studied non-distributed (i.e., merged) streams models, presenting tight bounds on the gap between the distributed and merged models for deterministic algorithms, and employs a novel coordinated sampling technique to extract a sample of the union.
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
TLDR
This work presents an approach, called distinct sampling, that collects a specially tailored sample over the distinct values in the input, in a single scan of the data, and shows how it can provide fast, highlyaccurate approximate answers for “report” queries in high-volume, session-based event recording environments, such as IP networks, customer service call centers, etc.
...
1
2
3
4
5
...