• Corpus ID: 215736964

Composable Sketches for Functions of Frequencies: Beyond the Worst Case

@inproceedings{Cohen2020ComposableSF,
  title={Composable Sketches for Functions of Frequencies: Beyond the Worst Case},
  author={Edith Cohen and Ofir Geri and R. Pagh},
  booktitle={ICML},
  year={2020}
}
Recently there has been increased interest in using machine learning techniques to improve classical algorithms. In this paper we study when it is possible to construct compact, composable sketches for weighted sampling and statistics estimation according to functions of data frequencies. Such structures are now central components of large-scale data analytics and machine learning pipelines. However, many common functions, such as thresholds and $p$th frequency moments with $p>2$, are known to… 

Figures and Tables from this paper

Persistent Summaries
TLDR
This paper aims at designing persistent summaries, thereby giving streaming algorithms the ability to answer queries about the stream at any prior time.
Faster Fundamental Graph Algorithms via Learned Predictions
TLDR
A set of general learnability theorems are given, showing that the predictions required by the algorithms can be efficiently learned in a PAC fashion, leading to new algorithms for degree-constrained subgraph and minimum-cost 0-1 flow, based on reductions to bipartite matching and the shortest path problem.
Triangle and Four Cycle Counting with Predictions in Graph Streams
TLDR
The power of a “heavy edge” oracle in multiple graph edge streaming models is explored and a one-pass triangle counting algorithm improving upon the previous space upper bounds without such an oracle is presented.
DICTIONS IN GRAPH STREAMS
TLDR
The power of a “heavy edge” oracle in multiple graph edge streaming models is explored and a one-pass triangle counting algorithm improving upon the previous space upper bounds without such an oracle is presented.
Few-Shot Data-Driven Algorithms for Low Rank Approximation
TLDR
These algorithms are interpretable: while previous algorithms choose the sketching matrix either at random or by black-box learning, this work shows that it can be set to clearly interpretable values extracted from the dataset.
Faster Matchings via Learned Duals
TLDR
A rigorous, practical, and empirically effective method to compute bipartite matchings, and a first step in this direction by combining the idea of machine-learned predictions with the “warm-starting" primal-dual algorithms.
Learning Online Algorithms with Distributional Advice
TLDR
For the broad class of log-concave distributions, it is shown that poly(1/ ) samples suffice to obtain (1 + )competitive ratio, and the sample upper bound is close to best possible, even for very simple classes of distributions.
Non-Clairvoyant Scheduling with Predictions
TLDR
This work revisits the single-machine non-clairvoyant scheduling problem and proposes a new measure to gauge prediction quality and design scheduling algorithms with strong guarantees under this measure based on natural desiderata.
Differentially Private Weighted Sampling
TLDR
PWS maximizes the reporting probabilities of keys and improves over the state of the art also for the well-studied special case of {\em private histograms}, when no sampling is performed.
`p-SAMPLING WITHOUT REPLACEMENT
TLDR
This work designs novel composable sketches for WOR `p sampling, weighted sampling of keys according to a power p ∈ [0, 2] of their frequency (or for signed data, sum of updates) that have size that grows only linearly with the sample size.
...
...

References

SHOWING 1-10 OF 56 REFERENCES
Mergeable summaries
TLDR
This article demonstrates that heavy hitters and quantiles summaries are indeed mergeable or can be made mergeable after appropriate modifications, and provides the best known randomized streaming bound for ϵ-approximate quantiles that depends only on ϵ, of size O((1/ ϵ) log3/2(1/ϵ)), and demonstrates that the MG and the SpaceSaving summaries for heavy hitters are isomorphic.
Learning-Based Frequency Estimation Algorithms
TLDR
This work proposes a new class of algorithms that automatically learn relevant patterns in the input data and use them to improve its frequency estimates, and proves that these learning-based algorithms have lower estimation errors than their non-learning counterparts.
Perfect Lp Sampling in a Data Stream
TLDR
This paper shows that v need not factor into the space of an L_p sampler, which completely closes the complexity of the problem for this range of P, and shows that a (1± ε) relative error estimate of the frequency f_i of the sampled index i can be obtained using an additional O(�^-p log n)-bits of space for p < 2, and O(ε^-2 log^2 n) bits for p=2.
The space complexity of approximating the frequency moments
TLDR
It turns out that the numbers F0;F1 and F2 can be approximated in logarithmic space, whereas the approximation of Fk for k 6 requires n (1) space.
Asymptotic theory for order sampling
A Tight Lower Bound for High Frequency Moment Estimation with Small Error
TLDR
This lower bound matches the space complexity of an upper bound of Ganguly for any e 2 and e ≥ 1/n 1/p and is optimal for e < 1/log O(1) n.
Better Algorithms for Counting Triangles in Data Streams
TLDR
To do this, the first algorithm for lp sampling such that multiple independent samples can be generated with O(polylog n) update time is developed; this primitive is widely applicable and this result may be of independent interest.
Stable distributions
We give many explicit formulas for stable distributions, mainly based on Feller [3] and Zolotarev [14] and using several parametrizations; we give also some explicit calculations for convergence to
Stable distributions, pseudorandom generators, embeddings, and data stream computation
TLDR
The aforementioned sketching approach directly translates into an approximate algorithm that solves the main open problem of Feigenbaum et al.
Finding Frequent Items in Data Streams
TLDR
This work presents a 1-pass algorithm for estimating the most frequent items in a data stream using limited storage space, which achieves better space bounds than the previously known best algorithms for this problem for several natural distributions on the item frequencies.
...
...