• Corpus ID: 235446639

Learning-based Support Estimation in Sublinear Time

  title={Learning-based Support Estimation in Sublinear Time},
  author={Talya Eden and Piotr Indyk and Shyam Narayanan and Ronitt Rubinfeld and Sandeep Silwal and Tal Wagner},
We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to ±εn from a sample of size O(log(1/ε) · n/ log n), where n is the data set size… 

Figures and Tables from this paper

Bias Reduction for Sum Estimation

In classical statistics and distribution testing, it is often assumed that elements can be sampled exactly from some distribution P , and that when an element x is sampled, the probability P ( x ) of

Putting the "Learning" into Learning-Augmented Algorithms for Frequency Estimation

It is shown that machine learning models which are trained to optimize for coverage lead to large improvements in performance over prior approaches according to the average absolute frequency error.

Improved Learning-augmented Algorithms for k-means and k-medians Clustering

This work proposes a deterministic k means algorithm that produces centers with improved bound on clustering cost compared to the previous randomized algorithm while preserving the O ( dm log m ) runtime.

Learning-Augmented Maximum Flow

An algorithm is presented that, given an oracle access to a distribution over flow networks, it is possible to efficiently PAC-learn a prediction minimizing the expected ℓ 1 error over that distribution.

Few-Shot Data-Driven Algorithms for Low Rank Approximation

These algorithms are interpretable: while previous algorithms choose the sketching matrix either at random or by black-box learning, this work shows that it can be set to clearly interpretable values extracted from the dataset.


The power of a “heavy edge” oracle in multiple graph edge streaming models is explored and a one-pass triangle counting algorithm improving upon the previous space upper bounds without such an oracle is presented.

Learning-Augmented Algorithms for Online Linear and Semidefinite Programming

This paper studies online covering linear and semidefinite programs in which the algorithm is augmented with advice from a possibly erroneous predictor, and introduces a framework that extends both the online set cover problem augmented with machine-learning predictors and the online covering SDP problem, initiated by Elad, Kale, and Naor.

Daisy Bloom Filters

A near-optimal choice of the parameters k x in a model where n elements are inserted independently from a probability distribution P and query elements are chosen from a probabilities distribution Q is determined, under a bound on the false positive probability F .

(Optimal) Online Bipartite Matching with Predicted Degrees

This model for online graph problems where algorithms are given access to an oracle that predicts (e.g., based on past data) the degrees of nodes in the graph is proposed, and a natural greedy matching algorithm called MinPredictedDegree, which uses predictions of thedegree of offline nodes is studied.

Online Bipartite Matching with Predicted Degrees

This work proposes a model for online graph problems where algorithms are given access to an oracle that predicts the degrees of nodes in the graph (e.g., based on past data) and shows that a greedy algorithm called MinPredictedDegree compares favorably to state-of-the-art online algorithms for this problem.



Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs

We introduce a new approach to characterizing the unobserved portion of a distribution, which provides sublinear--sample estimators achieving arbitrarily small additive constant error for a class of

Estimating the Unseen: Improved Estimators for Entropy and other Properties

This work proposes a novel modification of the Good-Turing frequency estimation scheme, which seeks to estimate the shape of the unobserved portion of the distribution, and is robust, general, and theoretically principled; it is expected that it may be fruitfully used as a component within larger machine learning and data analysis systems.

Learning-Augmented Data Stream Algorithms

The full power of an oracle trained to predict item frequencies in the streaming model is explored, showing that it can be applied to a wide array of problems in data streams, sometimes resulting in the first optimal bounds for such problems.

Optimal prediction of the number of unseen species

A class of simple algorithms are obtained that provably predict U all of the way up to t∝log⁡n samples, and it is shown that this range is the best possible and that the estimator’s mean-square error is near optimal for any t.

Probability-Revealing Samples

This work introduces a model in which every sample comes with the information about the probability of selecting it, and gives algorithms for problems such as testing if two distributions are (approximately) identical, estimating the total variation distance between distributions, and estimating the support size.

An optimal algorithm for the distinct elements problem

The first optimal algorithm for estimating the number of distinct elements in a data stream is given, closing a long line of theoretical research on this problem, and has optimal O(1) update and reporting times.

Spreading vectors for similarity search

This work designs and trains a neural net which last layer forms a fixed parameter-free quantizer, such as pre-defined points of a hyper-sphere, and proposes a new regularizer derived from the Kozachenko--Leonenko differential entropy estimator to enforce uniformity and combine it with a locality-aware triplet loss.

Learning-Based Frequency Estimation Algorithms

This work proposes a new class of algorithms that automatically learn relevant patterns in the input data and use them to improve its frequency estimates, and proves that these learning-based algorithms have lower estimation errors than their non-learning counterparts.

Learning to Branch

It is shown how to use machine learning to determine an optimal weighting of any set of partitioning procedures for the instance distribution at hand using samples from the distribution, and it is proved that this reduction can even be exponential.

Chebyshev polynomials, moment matching, and optimal estimation of the unseen

The procedure is a linear estimator based on the Chebyshev polynomial and its approximation-theoretic properties, which can be evaluated in $O(n+\log^2 k)$ time and attains the sample complexity within a factor of six asymptotically.