Frugal Streaming for Estimating Quantiles

@inproceedings{Ma2013FrugalSF,
  title={Frugal Streaming for Estimating Quantiles},
  author={Qiang Ma and Sambavi Muthukrishnan and Mark Sandler},
  booktitle={Space-Efficient Data Structures, Streams, and Algorithms},
  year={2013}
}
Modern applications require processing streams of data for estimating statistical quantities such as quantiles with small amount of memory. In many such applications, in fact, one needs to compute such statistical quantities for each of a large number of groups (e.g.,network traffic grouped by source IP address), which additionally restricts the amount of memory available for the stream for any particular group. We address this challenge and introduce frugal streaming, that is algorithms that… 
Tracking of multiple quantiles in dynamically varying data streams
TLDR
Experiments show that the method efficiently tracks multiple quantiles and outperforms state-of-the-art methods.
Estimation of Multiple Quantiles in Dynamically Varying Data Streams
TLDR
The method is memory and computationally efficient since it only stores one value for each quantile estimate and only performs one operation per quantiles estimate when a new sample is received from the data stream.
Data Skepticism in Practice Online Anomaly Detection over Big Data Streams
TLDR
This thesis describes and empirically evaluates the design and implementation of a framework for data quality testing over real-world streams in a large-scale telecommunication network and proposes two measures for dynamically detecting anomalies: relative entropy for detecting changes in the users’ activity over time and Pearson correlation for detecting anomalies affecting individual data streams.
A Higher-Fidelity Frugal Quantile Estimator
TLDR
Comprehensive simulation results show that the present estimator outperforms the original Frugal algorithm in terms of accuracy, and is the first paper, to the authors' knowledge, that proves the advantages of discretization within the domain of quantile estimation.
Online anomaly detection over Big Data streams
TLDR
A combination of the two metrics put forward can be applied to detect several types of anomalies - like infrastructure failures, hardware misconfiguration or user-driven anomalies - in large-scale telecommunication networks.
Joint tracking of multiple quantiles through conditional quantiles
Quantile Tracking in Dynamically Varying Data Streams Using a Generalized Exponentially Weighted Average of Observations
TLDR
This work presents a lightweight quantile estimator using a generalized form of the Exponentially Weighted Average that outperforms legacy state-of-the-art quantile tracking estimators and achieves faster adaptivity in dynamic environments.
Efficient quantile tracking using an oracle
TLDR
This paper suggests using expected quantile loss, a popular loss function in quantile regression, to monitor the quantile tracking error, which is used to efficiently adapt to concept drift, which shows that the tracking performance is close to theoretically optimal.
Optimal Quantile Approximation in Streams
TLDR
This paper resolves one of the longest standing basic problems in the streaming computational model and proves a qualitative gap between randomized and deterministic quantile sketching for which an Ω((1/ε)log log (1/δ)) lower bound is known.
...
...

References

SHOWING 1-10 OF 17 REFERENCES
Space- and time-efficient deterministic algorithms for biased quantiles over data streams
TLDR
This work presents the first deterministic algorithms for answering biased quantiles queries accurately with small—sublinear in the input size—space and time bounds in one pass, and shows it uses less space than existing methods in many practical settings, and is fast to maintain.
Approximate medians and other quantiles in one pass and with limited memory
TLDR
New algorithms for computing approximate quantiles of large datasets in a single pass are presented, and the main memory requirements are smaller than those reported by an order of magnitude.
A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data
TLDR
The experimental results show that the algorithm is indeed robust and does not depend on the distribution of the data sets, and extra time and memory for computing additional quantiles (beyond the first one) are constant per quantile.
Continuously maintaining quantile summaries of the most recent N elements over a data stream
TLDR
An algorithm that maintains quantile summaries for most recent N elements so that quantile queries on any most recent n elements can be answered with a guaranteed precision of /spl epsiv/n and the space requirement is much less than the given theoretical bound.
A One-Pass Space-Efficient Algorithm for Finding Quantiles
TLDR
An algorithm fording the quantile values of a large unordered dataset with unknown distribution that requires only one pass over the data and the true quantile is guaranteed to lie within the lower and upper bounds produced by the algorithm.
Space-efficient online computation of quantile summaries
TLDR
The actual space bounds obtained on experimental data are significantly better than the worst case guarantees of the algorithm as well as the observed space requirements of earlier algorithms.
An improved data stream summary: the count-min sketch and its applications
TLDR
The Count-Min Sketch allows fundamental queries in data stream summarization such as point, range, and inner product queries to be approximately answered very quickly and can be applied to solve several important problems in data streams such as finding quantiles, frequent items, etc.
Maintaining variance and k-medians over data stream windows
TLDR
A novel technique is presented for solving two important and related problems in the sliding window model---maintaining variance and maintaining a <i>k</i>--median clustering and a constant-factor approximation algorithm is presented.
Stream Order and Order Statistics: Quantile Estimation in Random-Order Streams
TLDR
The first fully general lower bounds in the random-order model are proved: finding an element with rank n/2 ± n δ in the single-pass random- order model with probability at least 9/10 requires Ω( � n 1−3δ / logn) space.
...
...