Optimal Streaming and Tracking Distinct Elements with High Probability

@article{Basiok2018OptimalSA,
  title={Optimal Streaming and Tracking Distinct Elements with High Probability},
  author={Jarosław Błasiok},
  journal={ACM Transactions on Algorithms (TALG)},
  year={2018},
  volume={16},
  pages={1 - 28}
}
  • Jarosław Błasiok
  • Published 7 January 2018
  • Computer Science, Mathematics
  • ACM Transactions on Algorithms (TALG)
The distinct elements problem is one of the fundamental problems in streaming algorithms—given a stream of integers in the range { 1,… ,n}, we wish to provide a (1+ε) approximation to the number of distinct elements in the input. After a long line of research an optimal solution for this problem with constant probability of success, using O(1/ε2+lg n) bits of space, was given by Kane, Nelson, and Woodruff in 2010. The standard approach used to achieve low failure probability δ is to take the… 

Tight Trade-offs for the Maximum k-Coverage Problem in the General Streaming Model

TLDR
A single-pass algorithm is designed that reports an α-approximate solution in $\tildeO (m/α^2 + k)$ space and heavily exploits data stream sketching techniques, which could lead to further connections between vector sketching methods and streaming algorithms for combinatorial optimization tasks.

New Directions in Streaming Algorithms

TLDR
This thesis describes optimal streaming algorithms for set cover and maximum coverage, two classic problems in combinatorial optimization and shows how to augment classic streaming algorithms of the frequency estimation and low-rank approximation problems with machine learning oracles in order to improve their space-accuracy tradeoffs.

No Repetition: Fast Streaming with Highly Concentrated Hashing

TLDR
Here the point is that if the authors have a hash function with strong concentration bounds, then they get the same high probability bounds without any need for repetitions, and the overall algorithms just get simpler.

Tight Bounds for Adversarially Robust Streams and Sliding Windows via Difference Estimators

TLDR
The results show there is no separation between the sliding window model and the standard data stream model in terms of the approximation factor, and the first difference estimators for a wide range of problems are developed.

Pairwise Independent Random Walks can be Slightly Unbounded

TLDR
This paper proves a generalization of Kolmogorov's maximal inequality by showing an equivalent statement that requires only $4$-wise independent random variables with bounded second moments, which also generalizes a result of [5].

Sliding-Window Streaming Algorithms for Graph Problems and `p-Sampling

TLDR
It is shown that both the vertex-cover size and the maximum-matching size are 2-almost-smooth, and thus can be approximated using the smooth-histogram framework in the sliding-window model, and developed algorithms for several problems, including `p-sampling and maximum- matching, all in the slide- window model.

Cardinality estimation using Gumbel distribution

TLDR
A modification to both LogLog and HyperLogLog is proposed that replaces discrete geometric distribution with a continuous Gumbel distribution, which leads to a very short, simple and elementary analysis of estimation guarantees, and smoother behavior of the estimator.

Simple and Efficient Cardinality Estimation in Data Streams

TLDR
A new class of "curtain" sketches that are a bit more complex than Martingale LogLog but with substantially better MVPs, e.g., MartingALE Curtain has MVP of around $1.63$, and conjecture this to be an information-theoretic lower bound on the problem, independent of update time.

Nearly Optimal Distinct Elements and Heavy Hitters on Sliding Windows

TLDR
The composable histogram along with a careful combination of existing techniques to track either the identity or frequency of a few specific items suffices to obtain algorithms for both distinct elements and $\ell_p$-heavy hitters that are nearly optimal in both $n$ and $\epsilon$.

Information theoretic limits of cardinality estimation: Fisher meets Shannon

TLDR
A new measure of efficiency for cardinality estimators called the Fisher-Shannon (Fish) number H/I is defined, which captures the tension between the limiting Shannon entropy of the sketch and its normalized Fisher information, which characterizes the variance of a statistically efficient, asymptotically unbiased estimator.

References

SHOWING 1-10 OF 30 REFERENCES

An optimal algorithm for the distinct elements problem

TLDR
The first optimal algorithm for estimating the number of distinct elements in a data stream is given, closing a long line of theoretical research on this problem, and has optimal O(1) update and reporting times.

BPTree: An ℓ2 Heavy Hitters Algorithm Using Constant Memory

TLDR
This work gives an algorithm BPTree for l2 heavy hitters in insertion-only streams that achieves O(ε-2logε-1) words of memory and O(logε -1) update time, which is the optimal dependence on n and m, and describes an algorithm for tracking ||ƒ||2 at all times with O-2) memory and update time.

Tracking the Frequency Moments at All Times

TLDR
It is shown that for the $F_p$ problem for any $1 < p < p \le 2$, the authors actually only need O(\log \log m + \log n) copies to achieve the tracking guarantee in the cash register model, where $n$ is the universe size.

Randomness-optimal oblivious sampling

  • D. Zuckerman
  • Computer Science, Mathematics
    Random Struct. Algorithms
  • 1997
TLDR
This work presents the first efficient oblivious sampler that uses an optimal number of random bits, up to an arbitrary constant factor bigger than 1, and gives applications to constructive leader election and reducing randomness in interactive proofs.

Streaming Space Complexity of Nearly All Functions of One Variable on Frequency Vectors

TLDR
For nearly all functions of one variable, the open question of which functions on a stream can be approximated in sublinear, and especially sub-polynomial or poly-logarithmic, space is answered.

Optimal space lower bounds for all frequency moments

TLDR
It is proved that any one-pass streaming algorithm which (ε, Δ)-approximates the kth frequency moment, for any real <i>k</i> ≠ 1 and any ε = Ω(1/√m), must use Ω (1/ε²) bits of space, where m is the size of the universe.

Estimating simple functions on the union of data streams

TLDR
The distributed streams model is related to previously studied non-distributed (i.e., merged) streams models, presenting tight bounds on the gap between the distributed and merged models for deterministic algorithms, and employs a novel coordinated sampling technique to extract a sample of the union.

Probabilistic counting

  • P. FlajoletG. Martin
  • Computer Science
    24th Annual Symposium on Foundations of Computer Science (sfcs 1983)
  • 1983
TLDR
A class of probabilistic algorithms with which one can estimate the number of distinct elements in a collection of data in a single pass, using only 0(1) auxiliary storage and 0( 1) operations per element, is presented.

Loglog Counting of Large Cardinalities (Extended Abstract)

TLDR
The LogLog algorithm makes use of m "small bytes" of auxiliary memory in order to estimate in a single pass the number of distinct elements (the "cardinality") in a file, and it does so with an accuracy that is of the order of 1/ √ m.

A Multi-Round Communication Lower Bound for Gap Hamming and Some Consequences

TLDR
It is concluded, for instance, that $\epsilon$-approximately counting the number of distinct elements in a data stream requires $\Omega(1/\ep silon^2)$ space, even with multiple (a constant number of) passes over the input stream.