A Survey of Approximate Quantile Computation on Large-Scale Data

@article{Chen2020ASO,
  title={A Survey of Approximate Quantile Computation on Large-Scale Data},
  author={Zhiwei Chen and Aoqian Zhang},
  journal={IEEE Access},
  year={2020},
  volume={8},
  pages={34585-34597}
}
As data volume grows extensively, data profiling helps to extract metadata of large-scale data. However, one kind of metadata, order statistics, is difficult to be computed because they are not mergeable or incremental. Thus, the limitation of time and memory space does not support their computation on large-scale data. In this paper, we focus on an order statistic, quantiles, and present a comprehensive analysis of studies on approximate quantile computation. Both deterministic algorithms and… 

Figures and Tables from this paper

Efficient and Error-bounded Spatiotemporal Quantile Monitoring in Edge Computing Environments
TLDR
This paper designs a processing framework that virtualizes edge-resident data sketches for quantile computing and devise a relaxation algorithm to converge to optimal latencies for those subqueries whose result errors are still bounded.
KLL±: Approximate Quantile Sketches over Dynamic Datasets
TLDR
KLL± is proposed, the first quantile approximation algorithm to operate in theounded deletionmodel to account for both inserts and deletes in a given data stream to support arbitrary updates with small space overhead.
Optimal Round and Sample-Size Complexity for Partitioning in Parallel Sorting
TLDR
This work derives lower and upper bounds on the number of sampling/histogramming rounds required to compute a balanced partitioning and proposes a hard randomized input distribution and applies classical results from the distribution theory of runs to derive the lower bound.
Histogram Specification by Assignment of Optimal Unique Values
TLDR
Two novel algorithms for histogram specification and quantile transformation of data without local information are proposed that can be easily incorporated in applications spanning many disciplines, especially in applied data science.
Scaling Equi-Joins
TLDR
This paper proposes Adaptive-Multistage-Join (AM-Join), a novel algorithm that scales well when the joined tables share hot keys, and Broadcast-Join, the fastest-known when joining keys are hot in only one table, for scalable and fast equi-joins in distributed shared-nothing architectures.
Spiking Neural Networks Through the Lens of Streaming Algorithms
TLDR
A generic reduction is given, showing that any space- efficient spiking neural network can be simulated by a space-efficiently streaming algorithm, and establishing a close connection between these two models.
Assessment of Variability in Irregularly Sampled Time Series: Applications to Mental Healthcare
TLDR
Different variability metrics applied to irregularly (nonuniformly) sampled time series, which have important clinical applications, particularly in mental healthcare, are compared to identify the most robust and interpretable variability measures out of a set 21 candidates.
AcME - Accelerated Model-agnostic Explanations: Fast Whitening of the Machine-Learning Black Box
TLDR
Accelerated Model-agnostic Explanations (AcME), an interpretability approach that quickly provides feature importance scores both at the global and the local level, which can be applied a posteriori to each regression or classification model.
hermiter: R package for Sequential Nonparametric Estimation
This article introduces the R package hermiter which facilitates estimation of univariate and bivariate probability density functions and cumulative distribution functions along with full quantile
Statistical anonymity: Quantifying reidentification risks without reidentifying users
TLDR
This paper explores ideas — objectives, metrics, protocols, and extensions — for reducing the trust that must be placed in the curator, while still maintaining a statistical notion of k-anonymity, and describes a class of protocols aimed at achieving these goals.

References

SHOWING 1-10 OF 90 REFERENCES
Space- and time-efficient deterministic algorithms for biased quantiles over data streams
TLDR
This work presents the first deterministic algorithms for answering biased quantiles queries accurately with small—sublinear in the input size—space and time bounds in one pass, and shows it uses less space than existing methods in many practical settings, and is fast to maintain.
Quantiles over data streams: experimental comparisons, new analyses, and further improvements
TLDR
This paper provides a taxonomy of different methods and proposes new variants that have not been studied before, yet which outperform existing methods and describe efficient implementations of these methods.
Computing Extremely Accurate Quantiles Using t-Digests
TLDR
This new algorithm is robust with respect to skewed distributions or ordered datasets and allows separately computed summaries to be combined with no loss in accuracy.
A Fast Algorithm for Approximate Quantiles in High Speed Data Streams
  • Qi Zhang, Wei Wang
  • Computer Science
    19th International Conference on Scientific and Statistical Database Management (SSDBM 2007)
  • 2007
TLDR
A fast algorithm for computing approximate quantiles in high speed data streams with deterministic error bounds for data streams of size N where N is unknown in advance and the stream is partitioned into sub-streams of exponentially increasing size as they arrive.
Effective computation of biased quantiles over data streams
TLDR
This paper formalizes them as the "high-biased" and the "targeted" quantiles problems, respectively, and presents algorithms with provable guarantees, that perform significantly better than previously known solutions for these problems.
Accurate quantile estimation for skewed data streams
  • Zheng Lin, Jun Liu, N. Lin
  • Computer Science
    2017 IEEE 28th Annual International Symposium on Personal, Indoor, and Mobile Radio Communications (PIMRC)
  • 2017
TLDR
The comprehensive experimental evaluation results demonstrate that the estimated quantiles of the proposed algorithm are highly accurate than existing methods in the literature on both synthetic and real-world datasets, especially on important extreme quantiles.
Approximate medians and other quantiles in one pass and with limited memory
TLDR
New algorithms for computing approximate quantiles of large datasets in a single pass are presented, and the main memory requirements are smaller than those reported by an order of magnitude.
A One-Pass Space-Efficient Algorithm for Finding Quantiles
TLDR
An algorithm fording the quantile values of a large unordered dataset with unknown distribution that requires only one pass over the data and the true quantile is guaranteed to lie within the lower and upper bounds produced by the algorithm.
Accurate Quantile Estimation for Skewed Data Streams Using Nonlinear Interpolation
TLDR
The comprehensive experimental evaluation results demonstrate that the estimated quantiles of the proposed algorithm are more accurate than the existing methods in the literature on both synthetic and real-world datasets, especially on important extreme quantiles.
A One-Pass Algorithm for Accurately Estimating Quantiles for Disk-Resident Data
TLDR
The experimental results show that the algorithm is indeed robust and does not depend on the distribution of the data sets, and extra time and memory for computing additional quantiles (beyond the first one) are constant per quantile.
...
...