# Learning-based Support Estimation in Sublinear Time

@article{Eden2021LearningbasedSE, title={Learning-based Support Estimation in Sublinear Time}, author={T. Eden and P. Indyk and Shyam Narayanan and R. Rubinfeld and Sandeep Silwal and Tal Wagner}, journal={ArXiv}, year={2021}, volume={abs/2106.08396} }

We consider the problem of estimating the number of distinct elements in a large data set (or, equivalently, the support size of the distribution induced by the data set) from a random sample of its elements. The problem occurs in many applications, including biology, genomics, computer systems and linguistics. A line of research spanning the last decade resulted in algorithms that estimate the support up to ±εn from a sample of size O(log(1/ε) · n/ log n), where n is the data set size… Expand

#### One Citation

Putting the "Learning" into Learning-Augmented Algorithms for Frequency Estimation

- Computer Science
- ICML
- 2021

It is shown that machine learning models which are trained to optimize for coverage lead to large improvements in performance over prior approaches according to the average absolute frequency error. Expand

#### References

SHOWING 1-10 OF 31 REFERENCES

Estimating the unseen: an n/log(n)-sample estimator for entropy and support size, shown optimal via new CLTs

- Mathematics, Computer Science
- STOC '11
- 2011

We introduce a new approach to characterizing the unobserved portion of a distribution, which provides sublinear--sample estimators achieving arbitrarily small additive constant error for a class of… Expand

Estimating the Unseen: Improved Estimators for Entropy and other Properties

- Computer Science, Mathematics
- NIPS
- 2013

This work proposes a novel modification of the Good-Turing frequency estimation scheme, which seeks to estimate the shape of the unobserved portion of the distribution, and is robust, general, and theoretically principled; it is expected that it may be fruitfully used as a component within larger machine learning and data analysis systems. Expand

Learning-Augmented Data Stream Algorithms

- Computer Science
- ICLR
- 2020

The full power of an oracle trained to predict item frequencies in the streaming model is explored, showing that it can be applied to a wide array of problems in data streams, sometimes resulting in the first optimal bounds for such problems. Expand

Optimal prediction of the number of unseen species

- Mathematics, Medicine
- Proceedings of the National Academy of Sciences
- 2016

A class of simple algorithms are obtained that provably predict U all of the way up to t∝logn samples, and it is shown that this range is the best possible and that the estimator’s mean-square error is near optimal for any t. Expand

Probability-Revealing Samples

- Computer Science
- AISTATS
- 2018

This work introduces a model in which every sample comes with the information about the probability of selecting it, and gives algorithms for problems such as testing if two distributions are (approximately) identical, estimating the total variation distance between distributions, and estimating the support size. Expand

An optimal algorithm for the distinct elements problem

- Computer Science
- PODS '10
- 2010

The first optimal algorithm for estimating the number of distinct elements in a data stream is given, closing a long line of theoretical research on this problem, and has optimal O(1) update and reporting times. Expand

Spreading vectors for similarity search

- Computer Science, Mathematics
- ICLR
- 2019

This work designs and trains a neural net which last layer forms a fixed parameter-free quantizer, such as pre-defined points of a hyper-sphere, and proposes a new regularizer derived from the Kozachenko--Leonenko differential entropy estimator to enforce uniformity and combine it with a locality-aware triplet loss. Expand

Learning-Based Frequency Estimation Algorithms

- Computer Science
- ICLR
- 2019

This work proposes a new class of algorithms that automatically learn relevant patterns in the input data and use them to improve its frequency estimates, and proves that these learning-based algorithms have lower estimation errors than their non-learning counterparts. Expand

Learning to Branch

- Computer Science
- ICML
- 2018

It is shown how to use machine learning to determine an optimal weighting of any set of partitioning procedures for the instance distribution at hand using samples from the distribution, and it is proved that this reduction can even be exponential. Expand

Chebyshev polynomials, moment matching, and optimal estimation of the unseen

- Mathematics
- 2015

We consider the problem of estimating the support size of a discrete distribution whose minimum non-zero mass is at least $ \frac{1}{k}$. Under the independent sampling model, we show that the sample… Expand