• Corpus ID: 231855412

Learning-augmented count-min sketches via Bayesian nonparametrics

@article{Dolera2021LearningaugmentedCS,
  title={Learning-augmented count-min sketches via Bayesian nonparametrics},
  author={Emanuele Dolera and Stefano Favaro and Stefano Peluchetti},
  journal={ArXiv},
  year={2021},
  volume={abs/2102.04462}
}
The count-min sketch (CMS) is a time and memory efficient randomized data structure that provides estimates of tokens’ frequencies in a data stream of tokens, i.e. point queries, based on random hashed data. A learning-augmented version of the CMS, referred to as CMS-DP, has been proposed by Cai, Mitzenmacher and Adams ( NeurIPS 2018), and it relies on Bayesian nonparametric (BNP) modeling of the data stream of tokens via a Dirichlet process (DP) prior, with estimates of a point query being… 
2 Citations

Bayesian nonparametric estimation of coverage probabilities and distinct counts from sketched data

The proposed Bayesian estimators are shown to be easily ap-plicable to large-scale analyses in combination with a Dirichlet process prior, although they involve some open computational challenges under the more general Pitman-Yor process prior.

Asymptotic Efficiency of Point Estimators in Bayesian Predictive Inference

The point estimation problems that emerge in Bayesian predictive inference are concerned with random quantities which depend on both observable and non-observable variables. Intuition suggests

References

SHOWING 1-10 OF 71 REFERENCES

A Bayesian Nonparametric View on Count-Min Sketch

A Bayesian view on the count-min sketch is presented, using the same data structure, but providing a posterior distribution over the frequencies that characterizes the uncertainty arising from the hash-based approximation, and it is shown that it is possible to straightforwardly compute posterior marginals of the unknown true counts.

A Bayesian nonparametric approach to count-min sketch under power-law data streams

A recent Bayesian nonparametric (BNP) view on the count-min sketch is relied on to develop a novel learning-augmented CMS under powerlaw data streams, which achieves a remarkable performance in the estimation of low-frequency tokens.

(Learned) Frequency Estimation Algorithms under Zipfian Distribution

The first error bounds for both the standard and the augmented version of Count-Sketch are provided, which show that to minimise the expected error, the number of hash functions should be a constant, strictly greater than $1$.

Count-Min: Optimal Estimation and Tight Error Bounds using Empirical Error Distributions

New count estimators are derived, including a provably optimal estimator, which best or match previous estimators in all scenarios and practical, tight error bounds at query time are provided for all estimators and methods to tune sketch parameters using these bounds.

Posterior Analysis for Normalized Random Measures with Independent Increments

A comprehensive Bayesian non‐parametric analysis of random probabilities which are obtained by normalizing random measures with independent increments (NRMI), which allows to derive a generalized Blackwell–MacQueen sampling scheme, which is then adapted to cover also mixture models driven by general NRMIs.

Are Gibbs-Type Priors the Most Natural Generalization of the Dirichlet Process?

The goal of this paper is to provide a systematic and unified treatment of Gibbs–type priors and highlight their implications for Bayesian nonparametric inference.

Count-Min-Log sketch: Approximately counting with approximate counters

This paper proposes the Count-Min-Log sketch, which uses logarithm-based, approximate counters instead of linear counters to improve the average relative error of CMS at constant memory footprint.

Hierarchical Mixture Modeling With Normalized Inverse-Gaussian Priors

In recent years the Dirichlet process prior has experienced a great success in the context of Bayesian mixture modeling. The idea of overcoming discreteness of its realizations by exploiting it in

On parameter estimation with the Wasserstein distance

These results cover the misspecified setting, in which the data-generating process is not assumed to be part of the family of distributions described by the model, and some difficulties arising in the numerical approximation of these estimators are discussed.

An improved data stream summary: the count-min sketch and its applications

The Count-Min Sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly and can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc.
...