# Optimal Streaming and Tracking Distinct Elements with High Probability

@article{Basiok2018OptimalSA, title={Optimal Streaming and Tracking Distinct Elements with High Probability}, author={Jarosław Błasiok}, journal={ACM Transactions on Algorithms (TALG)}, year={2018}, volume={16}, pages={1 - 28} }

The distinct elements problem is one of the fundamental problems in streaming algorithms—given a stream of integers in the range { 1,… ,n}, we wish to provide a (1+ε) approximation to the number of distinct elements in the input. After a long line of research an optimal solution for this problem with constant probability of success, using O(1/ε2+lg n) bits of space, was given by Kane, Nelson, and Woodruff in 2010. The standard approach used to achieve low failure probability δ is to take the…

## 30 Citations

### Tight Trade-offs for the Maximum k-Coverage Problem in the General Streaming Model

- Computer SciencePODS
- 2019

A single-pass algorithm is designed that reports an α-approximate solution in $\tildeO (m/α^2 + k)$ space and heavily exploits data stream sketching techniques, which could lead to further connections between vector sketching methods and streaming algorithms for combinatorial optimization tasks.

### New Directions in Streaming Algorithms

- Computer Science
- 2020

This thesis describes optimal streaming algorithms for set cover and maximum coverage, two classic problems in combinatorial optimization and shows how to augment classic streaming algorithms of the frequency estimation and low-rank approximation problems with machine learning oracles in order to improve their space-accuracy tradeoffs.

### No Repetition: Fast Streaming with Highly Concentrated Hashing

- Computer ScienceArXiv
- 2020

Here the point is that if the authors have a hash function with strong concentration bounds, then they get the same high probability bounds without any need for repetitions, and the overall algorithms just get simpler.

### Tight Bounds for Adversarially Robust Streams and Sliding Windows via Difference Estimators

- Computer Science2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS)
- 2022

The results show there is no separation between the sliding window model and the standard data stream model in terms of the approximation factor, and the first difference estimators for a wide range of problems are developed.

### Pairwise Independent Random Walks can be Slightly Unbounded

- Mathematics, Computer ScienceAPPROX-RANDOM
- 2019

This paper proves a generalization of Kolmogorov's maximal inequality by showing an equivalent statement that requires only $4$-wise independent random variables with bounded second moments, which also generalizes a result of [5].

### Sliding-Window Streaming Algorithms for Graph Problems and `p-Sampling

- Computer Science
- 2019

It is shown that both the vertex-cover size and the maximum-matching size are 2-almost-smooth, and thus can be approximated using the smooth-histogram framework in the sliding-window model, and developed algorithms for several problems, including `p-sampling and maximum- matching, all in the slide- window model.

### Cardinality estimation using Gumbel distribution

- Computer ScienceESA
- 2022

A modification to both LogLog and HyperLogLog is proposed that replaces discrete geometric distribution with a continuous Gumbel distribution, which leads to a very short, simple and elementary analysis of estimation guarantees, and smoother behavior of the estimator.

### Simple and Efficient Cardinality Estimation in Data Streams

- Computer Science, MathematicsArXiv
- 2020

A new class of "curtain" sketches that are a bit more complex than Martingale LogLog but with substantially better MVPs, e.g., MartingALE Curtain has MVP of around $1.63$, and conjecture this to be an information-theoretic lower bound on the problem, independent of update time.

### Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows

- Computer ScienceAPPROX-RANDOM
- 2018

The composable histogram along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and $\ell_p$-heavy hitters that are nearly optimal in both $n$ and $\epsilon$.

### Information theoretic limits of cardinality estimation: Fisher meets Shannon

- Computer Science, MathematicsSTOC
- 2021

A new measure of efficiency for cardinality estimators called the Fisher-Shannon (Fish) number H/I is defined, which captures the tension between the limiting Shannon entropy of the sketch and its normalized Fisher information, which characterizes the variance of a statistically efficient, asymptotically unbiased estimator.

## References

SHOWING 1-10 OF 30 REFERENCES

### An optimal algorithm for the distinct elements problem

- Computer SciencePODS '10
- 2010

The first optimal algorithm for estimating the number of distinct elements in a data stream is given, closing a long line of theoretical research on this problem, and has optimal O(1) update and reporting times.

### BPTree: An ℓ2 Heavy Hitters Algorithm Using Constant Memory

- Computer SciencePODS
- 2017

This work gives an algorithm BPTree for l2 heavy hitters in insertion-only streams that achieves O(ε-2logε-1) words of memory and O(logε -1) update time, which is the optimal dependence on n and m, and describes an algorithm for tracking ||ƒ||2 at all times with O-2) memory and update time.

### Tracking the Frequency Moments at All Times

- Computer ScienceArXiv
- 2014

It is shown that for the $F_p$ problem for any $1 < p < p \le 2$, the authors actually only need O(\log \log m + \log n) copies to achieve the tracking guarantee in the cash register model, where $n$ is the universe size.

### Randomness-optimal oblivious sampling

- Computer Science, MathematicsRandom Struct. Algorithms
- 1997

This work presents the first efficient oblivious sampler that uses an optimal number of random bits, up to an arbitrary constant factor bigger than 1, and gives applications to constructive leader election and reducing randomness in interactive proofs.

### Streaming Space Complexity of Nearly All Functions of One Variable on Frequency Vectors

- MathematicsPODS
- 2016

For nearly all functions of one variable, the open question of which functions on a stream can be approximated in sublinear, and especially sub-polynomial or poly-logarithmic, space is answered.

### Optimal space lower bounds for all frequency moments

- Computer ScienceSODA '04
- 2004

It is proved that any one-pass streaming algorithm which (ε, Δ)-approximates the kth frequency moment, for any real <i>k</i> ≠ 1 and any ε = Ω(1/√m), must use Ω (1/ε²) bits of space, where m is the size of the universe.

### Estimating simple functions on the union of data streams

- Computer ScienceSPAA '01
- 2001

The distributed streams model is related to previously studied non-distributed (i.e., merged) streams models, presenting tight bounds on the gap between the distributed and merged models for deterministic algorithms, and employs a novel coordinated sampling technique to extract a sample of the union.

### Probabilistic counting

- Computer Science24th Annual Symposium on Foundations of Computer Science (sfcs 1983)
- 1983

A class of probabilistic algorithms with which one can estimate the number of distinct elements in a collection of data in a single pass, using only 0(1) auxiliary storage and 0( 1) operations per element, is presented.

### Loglog Counting of Large Cardinalities (Extended Abstract)

- Computer ScienceESA
- 2003

The LogLog algorithm makes use of m "small bytes" of auxiliary memory in order to estimate in a single pass the number of distinct elements (the "cardinality") in a file, and it does so with an accuracy that is of the order of 1/ √ m.

### A Multi-Round Communication Lower Bound for Gap Hamming and Some Consequences

- Computer Science, Mathematics2009 24th Annual IEEE Conference on Computational Complexity
- 2009

It is concluded, for instance, that $\epsilon$-approximately counting the number of distinct elements in a data stream requires $\Omega(1/\ep silon^2)$ space, even with multiple (a constant number of) passes over the input stream.