An optimal algorithm for the distinct elements problem

@inproceedings{Kane2010AnOA,
title={An optimal algorithm for the distinct elements problem},
author={Daniel M. Kane and Jelani Nelson and David P. Woodruff},
booktitle={PODS '10},
year={2010}
}
• Published in PODS '10 6 June 2010
• Computer Science
We give the first optimal algorithm for estimating the number of distinct elements in a data stream, closing a long line of theoretical research on this problem begun by Flajolet and Martin in their seminal paper in FOCS 1983. [] Key Method This probability can be amplified by independent repetition.
308 Citations

Figures from this paper

Optimal streaming and tracking distinct elements with high probability
This work provides an optimal algorithm using O( lg δ −1 ε2 +lg n) bits of space — matching known lower bounds for this problem, and settles completely the space complexity of the distinct elements problem with respect to all standard parameters.
Optimal Streaming and Tracking Distinct Elements with High Probability
This work provides an optimal algorithm using O(lg δ−1/ε2 + lg n) bits of space—matching known lower bounds for this problem, and settles completely the space complexity of the distinct elements problem with respect to all standard parameters.
Tight bounds for data stream algorithms and communication problems
This thesis gives efficient algorithms and near-tight lower bounds for the following problems in the streaming model: finding duplicates in data streams, improving the algorithms of Gopalan and Radhakrishnan from SODA’09, and a O(log n logm) space algorithm that works entirely in the Streaming model.
An Improved Interactive Streaming Algorithm for the Distinct Elements Problem
• Computer Science
ICALP
• 2014
The exact computation of the number of distinct elements (frequency moment F 0) is a fundamental problem in the study of data streaming algorithms and a model where the data stream is also processed by a powerful helper, who provides an interactive proof of the result.
An Asymptotically Optimal Algorithm for Maximum Matching in Dynamic Streams
• Computer Science
ITCS
• 2022
We present an algorithm for the maximum matching problem in dynamic (insertion-deletions) streams with asymptotically optimal space: for any n-vertex graph, our algorithm with high probability
Streaming Algorithms for Robust Distinct Elements
• Computer Science
SIGMOD Conference
• 2016
This paper formalizes the problem of robust distinct elements, and develops space and time-efficient streaming algorithms for datasets in the Euclidean space, using a novel technique the authors call bucket sampling, and extends the algorithmic framework to other metric spaces by establishing a connection between bucket sampling and the theory of locality sensitive hashing.
An Optimal Lower Bound for Distinct Elements in the Message Passing Model
• Computer Science, Mathematics
SODA
• 2014
This work considers the setting in which each player holds a subset Si of elements of a universe of size n, and their goal is to output a (1 + e)-approximation to the total number of distinct elements in the union of the sets Si with constant probability, which can be amplified by independent repetition.
Better Streaming Algorithms for the Maximum Coverage Problem
• Computer Science, Mathematics
Theory of Computing Systems
• 2018
The main goal of this work is to design algorithms, with approximation guarantees as close as possible to 1−1/e$1-1/ e$, that use sublinear space o(mn)$o(mn), and to study the maximum k-vertex coverage problem in the dynamic graph stream model. Optimal streaming and tracking distinct elements with high probability This work provides an optimal algorithm using$\mathcal{O}(\frac{\log \delta^{-1}}{\varepsilon^2} + \log n)$bits of space, which it is shown to be optimal. An Optimal Algorithm for Large Frequency Moments Using O(n^(1-2/k)) Bits • Computer Science, Mathematics APPROX-RANDOM • 2014 This paper provides an upper bound on the space required to find a k-th frequency moment of O(n^(1-2/k) bits that matches, up to a constant factor, the lower bound of Woodruff et. References SHOWING 1-10 OF 50 REFERENCES Algorithms for dynamic geometric problems over data streams This paper presents low-storage data structures that maintain approximate solutions to geometric problems, under insertions and deletions of points (this is called a turnstile model in [24]). HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm • Computer Science • 2007 This extended abstract describes and analyses a near-optimal probabilistic algorithm, HYPERLOGLOG, dedicated to estimating the number of \emphdistinct elements (the cardinality) of very large data ensembles, and makes it possible to estimate cardinalities well beyond$10^9$with a typical accuracy of 2% while using a memory of only 1.5 kilobytes. Periodicity in Streams • Computer Science APPROX-RANDOM • 2010 A 1-pass randomized streaming algorithm that uses O(log2 n) space and reports the shortest period if the given stream is periodic, and a randomized streaming algorithms with approximation factor 2 + e that takes O(1/e2) space. A Multi-Round Communication Lower Bound for Gap Hamming and Some Consequences • Computer Science, Mathematics 2009 24th Annual IEEE Conference on Computational Complexity • 2009 It is concluded, for instance, that$\epsilon$-approximately counting the number of distinct elements in a data stream requires$\Omega(1/\ep silon^2)$space, even with multiple (a constant number of) passes over the input stream. Storage Estimation for Multidimensional Aggregates in the Presence of Hierarchies • Computer Science VLDB • 1996 Three strategies for estimating the storage blowup that will result from a proposed set of precomputations without actually computing them are proposed: one based on sampling, onebased on mathematical approximation, and one based upon probabilistic counting. Bitmap Algorithms for Counting Active Flows on High-Speed Links • Computer Science IEEE/ACM Transactions on Networking • 2006 A family of bitmap algorithms that address the problem of counting the number of distinct header patterns (flows) seen on a high-speed link and can be used to detect DoS attacks and port scans and to solve measurement problems. Uniform Hashing in Constant Time and Optimal Space • Computer Science, Mathematics SIAM J. Comput. • 2008 This paper presents an almost ideal solution to this problem: a hash function h: U: Uarrow V that, on any set of$n$inputs, behaves like a truly random function with high probability, can be evaluated in constant time on a RAM and can be stored in$(1+\epsilon)n\log |V| + O(n+\log \log |U|)\$ bits.
Loglog Counting of Large Cardinalities (Extended Abstract)
• Computer Science
ESA
• 2003
The LogLog algorithm makes use of m "small bytes" of auxiliary memory in order to estimate in a single pass the number of distinct elements (the "cardinality") in a file, and it does so with an accuracy that is of the order of 1/ √ m.
Estimating simple functions on the union of data streams
• Computer Science
SPAA '01
• 2001
The distributed streams model is related to previously studied non-distributed (i.e., merged) streams models, presenting tight bounds on the gap between the distributed and merged models for deterministic algorithms, and employs a novel coordinated sampling technique to extract a sample of the union.
Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports
This work presents an approach, called distinct sampling, that collects a specially tailored sample over the distinct values in the input, in a single scan of the data, and shows how it can provide fast, highlyaccurate approximate answers for “report” queries in high-volume, session-based event recording environments, such as IP networks, customer service call centers, etc.