# Optimal Streaming and Tracking Distinct Elements with High Probability

@article{Basiok2018OptimalSA,
title={Optimal Streaming and Tracking Distinct Elements with High Probability},
author={Jarosław Błasiok},
journal={ACM Transactions on Algorithms (TALG)},
year={2018},
volume={16},
pages={1 - 28}
}
• Jarosław Błasiok
• Published 7 January 2018
• Computer Science, Mathematics
• ACM Transactions on Algorithms (TALG)
The distinct elements problem is one of the fundamental problems in streaming algorithms—given a stream of integers in the range { 1,… ,n}, we wish to provide a (1+ε) approximation to the number of distinct elements in the input. After a long line of research an optimal solution for this problem with constant probability of success, using O(1/ε2+lg n) bits of space, was given by Kane, Nelson, and Woodruff in 2010. The standard approach used to achieve low failure probability δ is to take the…

### Tight Trade-offs for the Maximum k-Coverage Problem in the General Streaming Model

• Computer Science
PODS
• 2019
A single-pass algorithm is designed that reports an α-approximate solution in $\tildeO (m/α^2 + k)$ space and heavily exploits data stream sketching techniques, which could lead to further connections between vector sketching methods and streaming algorithms for combinatorial optimization tasks.

### New Directions in Streaming Algorithms

This thesis describes optimal streaming algorithms for set cover and maximum coverage, two classic problems in combinatorial optimization and shows how to augment classic streaming algorithms of the frequency estimation and low-rank approximation problems with machine learning oracles in order to improve their space-accuracy tradeoffs.

### No Repetition: Fast Streaming with Highly Concentrated Hashing

• Computer Science
ArXiv
• 2020
Here the point is that if the authors have a hash function with strong concentration bounds, then they get the same high probability bounds without any need for repetitions, and the overall algorithms just get simpler.

### Tight Bounds for Adversarially Robust Streams and Sliding Windows via Difference Estimators

• Computer Science
2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS)
• 2022
The results show there is no separation between the sliding window model and the standard data stream model in terms of the approximation factor, and the first difference estimators for a wide range of problems are developed.

### Pairwise Independent Random Walks can be Slightly Unbounded

This paper proves a generalization of Kolmogorov's maximal inequality by showing an equivalent statement that requires only $4$-wise independent random variables with bounded second moments, which also generalizes a result of [5].

### Sliding-Window Streaming Algorithms for Graph Problems and p-Sampling

• Computer Science
• 2019
It is shown that both the vertex-cover size and the maximum-matching size are 2-almost-smooth, and thus can be approximated using the smooth-histogram framework in the sliding-window model, and developed algorithms for several problems, including p-sampling and maximum- matching, all in the slide- window model.

### Cardinality estimation using Gumbel distribution

• Computer Science
ESA
• 2022
A modification to both LogLog and HyperLogLog is proposed that replaces discrete geometric distribution with a continuous Gumbel distribution, which leads to a very short, simple and elementary analysis of estimation guarantees, and smoother behavior of the estimator.

### Simple and Efficient Cardinality Estimation in Data Streams

• Computer Science, Mathematics
ArXiv
• 2020
A new class of "curtain" sketches that are a bit more complex than Martingale LogLog but with substantially better MVPs, e.g., MartingALE Curtain has MVP of around $1.63$, and conjecture this to be an information-theoretic lower bound on the problem, independent of update time.

### Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows

• Computer Science
APPROX-RANDOM
• 2018
The composable histogram along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and $\ell_p$-heavy hitters that are nearly optimal in both $n$ and $\epsilon$.

### Information theoretic limits of cardinality estimation: Fisher meets Shannon

• Computer Science, Mathematics
STOC
• 2021
A new measure of efficiency for cardinality estimators called the Fisher-Shannon (Fish) number H/I is defined, which captures the tension between the limiting Shannon entropy of the sketch and its normalized Fisher information, which characterizes the variance of a statistically efficient, asymptotically unbiased estimator.

## References

SHOWING 1-10 OF 30 REFERENCES

### An optimal algorithm for the distinct elements problem

• Computer Science
PODS '10
• 2010
The first optimal algorithm for estimating the number of distinct elements in a data stream is given, closing a long line of theoretical research on this problem, and has optimal O(1) update and reporting times.

### BPTree: An ℓ2 Heavy Hitters Algorithm Using Constant Memory

• Computer Science
PODS
• 2017
This work gives an algorithm BPTree for l2 heavy hitters in insertion-only streams that achieves O(ε-2logε-1) words of memory and O(logε -1) update time, which is the optimal dependence on n and m, and describes an algorithm for tracking ||ƒ||2 at all times with O-2) memory and update time.

### Tracking the Frequency Moments at All Times

• Computer Science
ArXiv
• 2014
It is shown that for the $F_p$ problem for any $1 < p < p \le 2$, the authors actually only need O(\log \log m + \log n) copies to achieve the tracking guarantee in the cash register model, where $n$ is the universe size.

### Randomness-optimal oblivious sampling

• D. Zuckerman
• Computer Science, Mathematics
Random Struct. Algorithms
• 1997
This work presents the first efficient oblivious sampler that uses an optimal number of random bits, up to an arbitrary constant factor bigger than 1, and gives applications to constructive leader election and reducing randomness in interactive proofs.

### Streaming Space Complexity of Nearly All Functions of One Variable on Frequency Vectors

• Mathematics
PODS
• 2016
For nearly all functions of one variable, the open question of which functions on a stream can be approximated in sublinear, and especially sub-polynomial or poly-logarithmic, space is answered.

### Optimal space lower bounds for all frequency moments

It is proved that any one-pass streaming algorithm which (ε, Δ)-approximates the kth frequency moment, for any real <i>k</i> ≠ 1 and any ε = Ω(1/√m), must use Ω (1/ε²) bits of space, where m is the size of the universe.

### Estimating simple functions on the union of data streams

• Computer Science
SPAA '01
• 2001
The distributed streams model is related to previously studied non-distributed (i.e., merged) streams models, presenting tight bounds on the gap between the distributed and merged models for deterministic algorithms, and employs a novel coordinated sampling technique to extract a sample of the union.

### Probabilistic counting

• Computer Science
24th Annual Symposium on Foundations of Computer Science (sfcs 1983)
• 1983
A class of probabilistic algorithms with which one can estimate the number of distinct elements in a collection of data in a single pass, using only 0(1) auxiliary storage and 0( 1) operations per element, is presented.

### Loglog Counting of Large Cardinalities (Extended Abstract)

• Computer Science
ESA
• 2003
The LogLog algorithm makes use of m "small bytes" of auxiliary memory in order to estimate in a single pass the number of distinct elements (the "cardinality") in a file, and it does so with an accuracy that is of the order of 1/ √ m.

### A Multi-Round Communication Lower Bound for Gap Hamming and Some Consequences

• Computer Science, Mathematics
2009 24th Annual IEEE Conference on Computational Complexity
• 2009
It is concluded, for instance, that $\epsilon$-approximately counting the number of distinct elements in a data stream requires $\Omega(1/\ep silon^2)$ space, even with multiple (a constant number of) passes over the input stream.