# An optimal algorithm for the distinct elements problem

@inproceedings{Kane2010AnOA, title={An optimal algorithm for the distinct elements problem}, author={Daniel M. Kane and Jelani Nelson and David P. Woodruff}, booktitle={PODS '10}, year={2010} }

We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. [... ] Key Method This probability can be amplified by independent repetition. Expand

## 308 Citations

Optimal streaming and tracking distinct elements with high probability

- Computer Science, Mathematics
- 2018

This work provides an optimal algorithm using O( lg δ −1 ε2 +lg n) bits of space — matching known lower bounds for this problem, and settles completely the space complexity of the distinct elements problem with respect to all standard parameters.

Optimal Streaming and Tracking Distinct Elements with High Probability

- Computer Science, MathematicsACM Trans. Algorithms
- 2020

This work provides an optimal algorithm using O(lg δ−1/ε2 + lg n) bits of space—matching known lower bounds for this problem, and settles completely the space complexity of the distinct elements problem with respect to all standard parameters.

Tight bounds for data stream algorithms and communication problems

- Computer Science
- 2011

This thesis gives efficient algorithms and near-tight lower bounds for the following problems in the streaming model: finding duplicates in data streams, improving the algorithms of Gopalan and Radhakrishnan from SODA’09, and a O(log n logm) space algorithm that works entirely in the Streaming model.

An Improved Interactive Streaming Algorithm for the Distinct Elements Problem

- Computer ScienceICALP
- 2014

The exact computation of the number of distinct elements (frequency moment F 0) is a fundamental problem in the study of data streaming algorithms and a model where the data stream is also processed by a powerful helper, who provides an interactive proof of the result.

An Asymptotically Optimal Algorithm for Maximum Matching in Dynamic Streams

- Computer ScienceITCS
- 2022

We present an algorithm for the maximum matching problem in dynamic (insertion-deletions) streams with asymptotically optimal space: for any n-vertex graph, our algorithm with high probability…

Streaming Algorithms for Robust Distinct Elements

- Computer ScienceSIGMOD Conference
- 2016

This paper formalizes the problem of robust distinct elements, and develops space and time-efficient streaming algorithms for datasets in the Euclidean space, using a novel technique the authors call bucket sampling, and extends the algorithmic framework to other metric spaces by establishing a connection between bucket sampling and the theory of locality sensitive hashing.

An Optimal Lower Bound for Distinct Elements in the Message Passing Model

- Computer Science, MathematicsSODA
- 2014

This work considers the setting in which each player holds a subset Si of elements of a universe of size n, and their goal is to output a (1 + e)-approximation to the total number of distinct elements in the union of the sets Si with constant probability, which can be amplified by independent repetition.

Better Streaming Algorithms for the Maximum Coverage Problem

- Computer Science, MathematicsTheory of Computing Systems
- 2018

The main goal of this work is to design algorithms, with approximation guarantees as close as possible to 1−1/e$1-1/ e$, that use sublinear space o(mn)$o(mn), and to study the maximum k-vertex coverage problem in the dynamic graph stream model.

Optimal streaming and tracking distinct elements with high probability

- Computer Science, MathematicsSODA
- 2018

This work provides an optimal algorithm using $\mathcal{O}(\frac{\log \delta^{-1}}{\varepsilon^2} + \log n)$ bits of space, which it is shown to be optimal.

An Optimal Algorithm for Large Frequency Moments Using O(n^(1-2/k)) Bits

- Computer Science, MathematicsAPPROX-RANDOM
- 2014

This paper provides an upper bound on the space required to find a k-th frequency moment of O(n^(1-2/k) bits that matches, up to a constant factor, the lower bound of Woodruff et.

## References

SHOWING 1-10 OF 50 REFERENCES

Algorithms for dynamic geometric problems over data streams

- Computer ScienceSTOC '04
- 2004

This paper presents low-storage data structures that maintain approximate solutions to geometric problems, under insertions and deletions of points (this is called a turnstile model in [24]).

HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm

- Computer Science
- 2007

This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of \emphdistinct elements (the cardinality) of very large data ensembles, and makes it possible to estimate cardinalities well beyond $10^9$ with a typical accuracy of 2% while using a memory of only 1.5 kilobytes.

Periodicity in Streams

- Computer ScienceAPPROX-RANDOM
- 2010

A 1-pass randomized streaming algorithm that uses O(log2 n) space and reports the shortest period if the given stream is periodic, and a randomized streaming algorithms with approximation factor 2 + e that takes O(1/e2) space.

A Multi-Round Communication Lower Bound for Gap Hamming and Some Consequences

- Computer Science, Mathematics2009 24th Annual IEEE Conference on Computational Complexity
- 2009

It is concluded, for instance, that $\epsilon$-approximately counting the number of distinct elements in a data stream requires $\Omega(1/\ep silon^2)$ space, even with multiple (a constant number of) passes over the input stream.

Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies

- Computer ScienceVLDB
- 1996

Three strategies for estimating the storage blowup that will result from a proposed set of precomputations without actually computing them are proposed: one based on sampling, onebased on mathematical approximation, and one based upon probabilistic counting.

Bitmap Algorithms for Counting Active Flows on High-Speed Links

- Computer ScienceIEEE/ACM Transactions on Networking
- 2006

A family of bitmap algorithms that address the problem of counting the number of distinct header patterns (flows) seen on a high-speed link and can be used to detect DoS attacks and port scans and to solve measurement problems.

Uniform Hashing in Constant Time and Optimal Space

- Computer Science, MathematicsSIAM J. Comput.
- 2008

This paper presents an almost ideal solution to this problem: a hash function h: U: Uarrow V that, on any set of $n$ inputs, behaves like a truly random function with high probability, can be evaluated in constant time on a RAM and can be stored in $(1+\epsilon)n\log |V| + O(n+\log \log |U|)$ bits.

Loglog Counting of Large Cardinalities (Extended Abstract)

- Computer ScienceESA
- 2003

The LogLog algorithm makes use of m "small bytes" of auxiliary memory in order to estimate in a single pass the number of distinct elements (the "cardinality") in a file, and it does so with an accuracy that is of the order of 1/ √ m.

Estimating simple functions on the union of data streams

- Computer ScienceSPAA '01
- 2001

The distributed streams model is related to previously studied non-distributed (i.e., merged) streams models, presenting tight bounds on the gap between the distributed and merged models for deterministic algorithms, and employs a novel coordinated sampling technique to extract a sample of the union.

Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports

- Computer ScienceVLDB
- 2001

This work presents an approach, called distinct sampling, that collects a specially tailored sample over the distinct values in the input, in a single scan of the data, and shows how it can provide fast, highlyaccurate approximate answers for “report” queries in high-volume, session-based event recording environments, such as IP networks, customer service call centers, etc.