Corpus ID: 235446639

Learning-based Support Estimation in Sublinear Time

  title={Learning-based Support Estimation in Sublinear Time},
  author={T. Eden and P. Indyk and Shyam Narayanan and R. Rubinfeld and Sandeep Silwal and Tal Wagner},
We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to ±εn from a sample of size O(log(1/ε) · n/ log n), where n is the data set size… Expand

Figures and Tables from this paper

Putting the "Learning" into Learning-Augmented Algorithms for Frequency Estimation
It is shown that machine learning models which are trained to optimize for coverage lead to large improvements in performance over prior approaches according to the average absolute frequency error. Expand


Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs
We introduce a new approach to characterizing the unobserved portion of a distribution, which provides sublinear--sample estimators achieving arbitrarily small additive constant error for a class ofExpand
Estimating the Unseen: Improved Estimators for Entropy and other Properties
This work proposes a novel modification of the Good-Turing frequency estimation scheme, which seeks to estimate the shape of the unobserved portion of the distribution, and is robust, general, and theoretically principled; it is expected that it may be fruitfully used as a component within larger machine learning and data analysis systems. Expand
Learning-Augmented Data Stream Algorithms
The full power of an oracle trained to predict item frequencies in the streaming model is explored, showing that it can be applied to a wide array of problems in data streams, sometimes resulting in the first optimal bounds for such problems. Expand
Optimal prediction of the number of unseen species
A class of simple algorithms are obtained that provably predict U all of the way up to t∝log⁡n samples, and it is shown that this range is the best possible and that the estimator’s mean-square error is near optimal for any t. Expand
Probability-Revealing Samples
This work introduces a model in which every sample comes with the information about the probability of selecting it, and gives algorithms for problems such as testing if two distributions are (approximately) identical, estimating the total variation distance between distributions, and estimating the support size. Expand
An optimal algorithm for the distinct elements problem
The first optimal algorithm for estimating the number of distinct elements in a data stream is given, closing a long line of theoretical research on this problem, and has optimal O(1) update and reporting times. Expand
Spreading vectors for similarity search
This work designs and trains a neural net which last layer forms a fixed parameter-free quantizer, such as pre-defined points of a hyper-sphere, and proposes a new regularizer derived from the Kozachenko--Leonenko differential entropy estimator to enforce uniformity and combine it with a locality-aware triplet loss. Expand
Learning-Based Frequency Estimation Algorithms
This work proposes a new class of algorithms that automatically learn relevant patterns in the input data and use them to improve its frequency estimates, and proves that these learning-based algorithms have lower estimation errors than their non-learning counterparts. Expand
Learning to Branch
It is shown how to use machine learning to determine an optimal weighting of any set of partitioning procedures for the instance distribution at hand using samples from the distribution, and it is proved that this reduction can even be exponential. Expand
Chebyshev polynomials, moment matching, and optimal estimation of the unseen
We consider the problem of estimating the support size of a discrete distribution whose minimum non-zero mass is at least $ \frac{1}{k}$. Under the independent sampling model, we show that the sampleExpand