• Corpus ID: 215736964

# Composable Sketches for Functions of Frequencies: Beyond the Worst Case

@inproceedings{Cohen2020ComposableSF,
title={Composable Sketches for Functions of Frequencies: Beyond the Worst Case},
author={Edith Cohen and Ofir Geri and R. Pagh},
booktitle={ICML},
year={2020}
}
• Published in ICML 9 April 2020
• Computer Science
Recently there has been increased interest in using machine learning techniques to improve classical algorithms. In this paper we study when it is possible to construct compact, composable sketches for weighted sampling and statistics estimation according to functions of data frequencies. Such structures are now central components of large-scale data analytics and machine learning pipelines. However, many common functions, such as thresholds and $p$th frequency moments with $p>2$, are known to…
15 Citations

## Figures and Tables from this paper

Persistent Summaries
• Computer Science
ACM Transactions on Database Systems
• 2022
This paper aims at designing persistent summaries, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.
Faster Fundamental Graph Algorithms via Learned Predictions
• Computer Science
ArXiv
• 2022
A set of general learnability theorems are given, showing that the predictions required by the algorithms can be eﬃciently learned in a PAC fashion, leading to new algorithms for degree-constrained subgraph and minimum-cost 0-1 ﬂow, based on reductions to bipartite matching and the shortest path problem.
Triangle and Four Cycle Counting with Predictions in Graph Streams
• Computer Science
ArXiv
• 2022
The power of a “heavy edge” oracle in multiple graph edge streaming models is explored and a one-pass triangle counting algorithm improving upon the previous space upper bounds without such an oracle is presented.
DICTIONS IN GRAPH STREAMS
• Computer Science
• 2022
The power of a “heavy edge” oracle in multiple graph edge streaming models is explored and a one-pass triangle counting algorithm improving upon the previous space upper bounds without such an oracle is presented.
Few-Shot Data-Driven Algorithms for Low Rank Approximation
• Computer Science
NeurIPS
• 2021
These algorithms are interpretable: while previous algorithms choose the sketching matrix either at random or by black-box learning, this work shows that it can be set to clearly interpretable values extracted from the dataset.
Faster Matchings via Learned Duals
• Computer Science
NeurIPS
• 2021
A rigorous, practical, and empirically effective method to compute bipartite matchings, and a first step in this direction by combining the idea of machine-learned predictions with the “warm-starting" primal-dual algorithms.
Learning Online Algorithms with Distributional Advice
• Computer Science, Mathematics
ICML
• 2021
For the broad class of log-concave distributions, it is shown that poly(1/ ) samples suffice to obtain (1 + )competitive ratio, and the sample upper bound is close to best possible, even for very simple classes of distributions.
Non-Clairvoyant Scheduling with Predictions
SPAA
• 2021
This work revisits the single-machine non-clairvoyant scheduling problem and proposes a new measure to gauge prediction quality and design scheduling algorithms with strong guarantees under this measure based on natural desiderata.
Differentially Private Weighted Sampling
• Computer Science
AISTATS
• 2021
PWS maximizes the reporting probabilities of keys and improves over the state of the art also for the well-studied special case of {\em private histograms}, when no sampling is performed.
p-SAMPLING WITHOUT REPLACEMENT
• Computer Science
• 2020
This work designs novel composable sketches for WOR p sampling, weighted sampling of keys according to a power p ∈ [0, 2] of their frequency (or for signed data, sum of updates) that have size that grows only linearly with the sample size.

## References

SHOWING 1-10 OF 56 REFERENCES
Mergeable summaries
• Computer Science
TODS
• 2013
This article demonstrates that heavy hitters and quantiles summaries are indeed mergeable or can be made mergeable after appropriate modifications, and provides the best known randomized streaming bound for ϵ-approximate quantiles that depends only on ϵ, of size O((1/ ϵ) log3/2(1/ϵ)), and demonstrates that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.
Learning-Based Frequency Estimation Algorithms
• Computer Science
ICLR
• 2019
This work proposes a new class of algorithms that automatically learn relevant patterns in the input data and use them to improve its frequency estimates, and proves that these learning-based algorithms have lower estimation errors than their non-learning counterparts.
Perfect Lp Sampling in a Data Stream
• Computer Science, Mathematics
2018 IEEE 59th Annual Symposium on Foundations of Computer Science (FOCS)
• 2018
This paper shows that v need not factor into the space of an L_p sampler, which completely closes the complexity of the problem for this range of P, and shows that a (1± ε) relative error estimate of the frequency f_i of the sampled index i can be obtained using an additional O(�^-p log n)-bits of space for p < 2, and O(ε^-2 log^2 n) bits for p=2.
The space complexity of approximating the frequency moments
• Mathematics
STOC '96
• 1996
It turns out that the numbers F0;F1 and F2 can be approximated in logarithmic space, whereas the approximation of Fk for k 6 requires n (1) space.
Asymptotic theory for order sampling
A Tight Lower Bound for High Frequency Moment Estimation with Small Error
• Computer Science, Mathematics
APPROX-RANDOM
• 2013
This lower bound matches the space complexity of an upper bound of Ganguly for any e 2 and e ≥ 1/n 1/p and is optimal for e < 1/log O(1) n.
Better Algorithms for Counting Triangles in Data Streams
• Computer Science, Mathematics
PODS
• 2016
To do this, the first algorithm for lp sampling such that multiple independent samples can be generated with O(polylog n) update time is developed; this primitive is widely applicable and this result may be of independent interest.
Stable distributions
We give many explicit formulas for stable distributions, mainly based on Feller [3] and Zolotarev [14] and using several parametrizations; we give also some explicit calculations for convergence to
Stable distributions, pseudorandom generators, embeddings, and data stream computation
The aforementioned sketching approach directly translates into an approximate algorithm that solves the main open problem of Feigenbaum et al.
Finding Frequent Items in Data Streams
• Computer Science
ICALP
• 2002
This work presents a 1-pass algorithm for estimating the most frequent items in a data stream using limited storage space, which achieves better space bounds than the previously known best algorithms for this problem for several natural distributions on the item frequencies.