Numerically stable parallel computation of (co-)variance

@article{Schubert2018NumericallySP,
  title={Numerically stable parallel computation of (co-)variance},
  author={Erich Schubert and Michael Gertz},
  journal={Proceedings of the 30th International Conference on Scientific and Statistical Database Management},
  year={2018}
}
  • Erich Schubert, Michael Gertz
  • Published 9 July 2018
  • Computer Science
  • Proceedings of the 30th International Conference on Scientific and Statistical Database Management
With the advent of big data, we see an increasing interest in computing correlations in huge data sets with both many instances and many variables. Essential descriptive statistics such as the variance, standard deviation, covariance, and correlation can suffer from a numerical instability known as "catastrophic cancellation" that can lead to problems when naively computing these statistics with a popular textbook equation. While this instability has been discussed in the literature already 50… 

Figures and Tables from this paper

BETULA: Numerically Stable CF-Trees for BIRCH Clustering
TLDR
This work introduces a replacement cluster feature that does not have this numeric problem, that is not much more expensive to maintain, and which makes many computations simpler and hence more efficient.
kEDM: A Performance-portable Implementation of Empirical Dynamic Modeling using Kokkos
TLDR
This paper designs and develops a performance-portable implementation of EDM based on the Kokkos performance portability framework (kEDM), which runs on both CPUs and GPUs while based on a single codebase.
Accurate and consistent calculation of the mean and variance in Monte-Carlo simulations
In parallelized Monte-Carlo simulations, the order of summation is not always the same. When the mean is calculated in running fashion, this may create an artificial randomness in results which ought
Improving the PAM, CLARA, and CLARANS Algorithms
TLDR
Modifications to the PAM algorithm are proposed where at the cost of storing O(k) additional values, the algorithm can achieve an O( k)-fold speedup in the second (“SWAP”) phase of the algorithm, but will still find the same results as the original PAM algorithms.
ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"
TLDR
The motivation for this release, the plans for the future, and a brief overview over the new functionality in this version of ELKI are outlined, which include an appendix presenting an overview on the overall implemented functionality.
Federated Reconnaissance: Efficient, Distributed, Class-Incremental Learning
TLDR
This work proposes an evaluation framework and methodological baseline for a system in which each client is expected to learn a growing set of classes and communicate knowledge of those classes efficiently with other clients, such that, after knowledge merging, the clients should be able to accurately discriminate between classes in the superset of classes observed by the set of clients.
Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms
TLDR
Modifications to the PAM algorithm are proposed to achieve an O(k)-fold speedup in the second SWAP phase of the algorithm, but will still find the same results as the original PAM algorithms.
Operon C++: an efficient genetic programming framework for symbolic regression
TLDR
Operon is introduced, a C++ GP framework focused on performance, modularity and usability, featuring an efficient linear tree encoding and a scalable concurrency model where each logical thread is responsible for generating a new individual.
A Triangle Inequality for Cosine Similarity
TLDR
This paper derives a triangle inequality for Cosine similarity that is suitable for efficient similarity search with many standard search structures (such as the VP-tree, Cover- tree, and M-tree); shows that this bound is tight and discusses fast approximations for it.
...
...

References

SHOWING 1-10 OF 33 REFERENCES
Formulas for robust, one-pass parallel computation of covariances and arbitrary-order statistical moments.
TLDR
A formula for the pairwise update of arbitrary-order centered statistical moments is presented, of particular interest to compute such moments in parallel for large-scale, distributed data sets.
A Closer Look at Variance Implementations In Modern Database Systems
TLDR
This paper studies variance implementations in various real-world systems and finds that major database systems such as PostgreSQL and most likely System X use a representation that is efficient, but suffers from floating point precision loss resulting from catastrophic cancellation.
Algorithms for Computing the Sample Variance: Analysis and Recommendations
TLDR
A survey of possible algorithms and their round-off error bounds is presented, including some new analysis for computations with shifted data, and experimental results confirm these bounds and illustrate the dangers of some algorithms.
The (black) art of runtime evaluation: Are we comparing algorithms or implementations?
TLDR
This work substantiates its points with extensive experiments, using clustering and outlier detection methods with and without index acceleration, and discusses what one can learn from evaluations, whether experiments are properly designed, and what kind of conclusions one should avoid.
Computing standard deviations: accuracy
TLDR
Four algorithms for the numerical computation of the standard deviation of (unweighted) sampled data are analyzed and it is concluded that all four algorithms will provide accurate answers for many problems, but two of the algorithms are substantially more accurate on difficult problems than are the other two.
Remark on stably updating mean and standard deviation of data
Although not published as a numbered algorithm, Hanson's article “Stably Updating Mean and Standard Deviation of Data” in the January, 1975, issue of Communications, [1] describes an algorithm for
Updating formulae and a pairwise algorithm for computing sample variances
A general formula is presented for computing the simple variance for a sample of size m + n given the means and variances for two subsamples of sizes m and n. This formula is used in the construction
Note on a Method for Calculating Corrected Sums of Squares and Products
In many problems the "corrected sum of squares" of a set of values must be calculated i.e. the sum of squares of the deviations of the values about their mean. The most usual way is to calculate the
Thirteen ways to look at the correlation coefficient
Abstract In 1885, Sir Francis Galton first defined the term “regression” and completed the theory of bivariate correlation. A decade later, Karl Pearson developed the index that we still use to
Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates
TLDR
This article proposes a technique for adaptive precision arithmetic that can often speed software-level algorithms for exact addition and multiplication of arbitrary precision floating-point values when they are used to perform multiprecision calculations that do not always require exact arithmetic, but must satisfy some error bound.
...
...