Relative Error Streaming Quantiles

@article{Cormode2021RelativeES,
  title={Relative Error Streaming Quantiles},
  author={Graham Cormode and Zohar S. Karnin and Edo Liberty and Justin Thaler and Pavel Vesel'y},
  journal={Proceedings of the 40th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems},
  year={2021}
}
Approximating ranks, quantiles, and distributions over streaming data is a central task in data analysis and monitoring. Given a stream of n items from a data universe U equipped with a total order, the task is to compute a sketch (data structure) of size poly (log(n), 1/ε). Given the sketch and a query item y ∈ U, one should be able to approximate its rank in the stream, i.e., the number of stream elements smaller than or equal to y. Most works to date focused on additive ε n error… 

Figures from this paper

Bounded Space Differentially Private Quantiles
TLDR
This work devise a differentially private algorithm for the quantile estimation problem, with strongly sublinear space complexity, in the one-shot and continual observation settings, and presents another algorithm based on histograms that is especially suited to the multiple quantiles case.
Theory meets Practice: worst case behavior of quantile algorithms
TLDR
This work shows how to construct inputs for t-digest that induce an almost arbitrarily large error and demonstrates that it fails to provide accurate results even on i.i.d. samples from a highly nonuniform distribution, and proposes practical improvements to ReqSketch, making it faster than t-Digest, while its error stays bounded on any instance.
Theory meets Practice at the Median: A Worst Case Comparison of Relative Error Quantile Algorithms
TLDR
This work shows how to construct inputs for t-digest that induce an almost arbitrarily large error and demonstrates that it fails to provide accurate results even on i.i.d. samples from a highly non-uniform distribution, and proposes practical improvements to ReqSketch, making it faster than t-Digest, while its error stays bounded on any instance.
SQUAD: Combining Sketching and Sampling Is Better than Either for Per-item Quantile Estimation
TLDR
This work designs an algorithm that augments a quantile sketch within each entry of a heavy hitter algorithm, resulting in similar space complexity but with a deterministic error guarantee, and presents SQUAD, a method that combines sampling and sketching while improving the asymptotic space complexity.
Asymmetric scale functions for t-digests
  • Joseph Ross
  • Mathematics
    Journal of Statistical Computation and Simulation
  • 2021
TLDR
A t-digest variant with accuracy asymmetric about the median is developed, thereby making possible alternative trade-offs between computational resources and accuracy which may be of particular interest for distributions with significant skew.
Amazon SageMaker Model Monitor: A System for Real-Time Insights into Deployed Machine Learning Models
TLDR
Amazon SageMaker Model Monitor is presented, a fully managed service that continuously monitors the quality of machine learning models hosted on Amazon SageMaker and automatically detects data, concept, bias, and feature attribution drift in models in real-time and provides alerts so that model owners can take corrective actions and thereby maintain high quality models.
Current Trends in Data Summaries
TLDR
In this column, recent developments in data summarization are surveyed, with the intent of inspiring further advances.
Relative Error Streaming Quantiles
TLDR
This paper presents a new approach to estimating ranks, quantiles, and distributions over streaming data by computing a sketch of size polylogarithmic in n from a data universe equipped with a total order.
A Human-Centric Take on Model Monitoring
TLDR
The need and the challenge for the model monitoring systems to clarify the impact of the monitoring observations on outcomes are found and such insights must be actionable, robust, customizable for domain-specific use cases, and cognitively considerate to avoid information overload.
Technical Perspective
  • R. Pagh
  • Computer Science
    SIGMOD Rec.
  • 2022
TLDR
Solutions to this problem have numerous applications in large-scale data analysis and can potentially be used for range query selectivity estimation in database engines.
...
...

References

SHOWING 1-10 OF 36 REFERENCES
Optimal Quantile Approximation in Streams
TLDR
This paper resolves one of the longest standing basic problems in the streaming computational model and proves a qualitative gap between randomized and deterministic quantile sketching for which an Ω((1/ε)log log (1/δ)) lower bound is known.
A Tight Lower Bound for Comparison-Based Quantile Summaries
TLDR
This paper focuses on comparison-based quantile summaries that can only compare two items and are otherwise completely oblivious of the universe, and improves the lower bound for biased quantiles, which provide a stronger, relative-error guarantee of (1+-ε)⋅ φ, and for other related computational tasks.
DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees
TLDR
This work presents the first fully-mergeable, relative-error quantile sketching algorithm with formal guarantees, which is extremely fast and accurate, and is currently being used by Datadog at a wide-scale.
Space- and time-efficient deterministic algorithms for biased quantiles over data streams
TLDR
This work presents the first deterministic algorithms for answering biased quantiles queries accurately with small—sublinear in the input size—space and time bounds in one pass, and shows it uses less space than existing methods in many practical settings, and is fast to maintain.
Space-efficient online computation of quantile summaries
TLDR
The actual space bounds obtained on experimental data are significantly better than the worst case guarantees of the algorithm as well as the observed space requirements of earlier algorithms.
Effective computation of biased quantiles over data streams
TLDR
This paper formalizes them as the "high-biased" and the "targeted" quantiles problems, respectively, and presents algorithms with provable guarantees, that perform significantly better than previously known solutions for these problems.
A Randomized Online Quantile Summary in O((1/ε) log(1/ε)) Words
TLDR
This paper develops a randomized online quantile summary for the cash register data input model and comparison data domain model that uses O( 1 " log 1 " ) words of memory that improves upon the previous best upper bound.
An efficient algorithm for approximate biased quantile computation in data streams
TLDR
This work proposes an efficient algorithm that dynamically maintains the biased quantile summary for the entire stream as the exponential histogram over the block-wise quantile summaries in large data streams.
Quantiles over data streams: experimental comparisons, new analyses, and further improvements
TLDR
This paper provides a taxonomy of different methods and proposes new variants that have not been studied before, yet which outperform existing methods and describe efficient implementations of these methods.
Random sampling techniques for space efficient online computation of order statistics of large datasets
TLDR
A novel non-uniform random sampling scheme and an extension of this framework are presented which form the basis of a new algorithm which computes approximate quantiles without knowing the input sequence length.
...
...