# Numerically stable parallel computation of (co-)variance

@article{Schubert2018NumericallySP, title={Numerically stable parallel computation of (co-)variance}, author={Erich Schubert and Michael Gertz}, journal={Proceedings of the 30th International Conference on Scientific and Statistical Database Management}, year={2018} }

With the advent of big data, we see an increasing interest in computing correlations in huge data sets with both many instances and many variables. Essential descriptive statistics such as the variance, standard deviation, covariance, and correlation can suffer from a numerical instability known as "catastrophic cancellation" that can lead to problems when naively computing these statistics with a popular textbook equation. While this instability has been discussed in the literature already 50…

## Figures and Tables from this paper

## 22 Citations

BETULA: Numerically Stable CF-Trees for BIRCH Clustering

- Computer ScienceSISAP
- 2020

This work introduces a replacement cluster feature that does not have this numeric problem, that is not much more expensive to maintain, and which makes many computations simpler and hence more efficient.

kEDM: A Performance-portable Implementation of Empirical Dynamic Modeling using Kokkos

- Computer SciencePEARC
- 2021

This paper designs and develops a performance-portable implementation of EDM based on the Kokkos performance portability framework (kEDM), which runs on both CPUs and GPUs while based on a single codebase.

Accurate and consistent calculation of the mean and variance in Monte-Carlo simulations

- Mathematics
- 2022

In parallelized Monte-Carlo simulations, the order of summation is not always the same. When the mean is calculated in running fashion, this may create an artificial randomness in results which ought…

Improving the PAM, CLARA, and CLARANS Algorithms

- Computer Science
- 2019

Modifications to the PAM algorithm are proposed where at the cost of storing O(k) additional values, the algorithm can achieve an O( k)-fold speedup in the second (“SWAP”) phase of the algorithm, but will still find the same results as the original PAM algorithms.

ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"

- Computer ScienceArXiv
- 2019

The motivation for this release, the plans for the future, and a brief overview over the new functionality in this version of ELKI are outlined, which include an appendix presenting an overview on the overall implemented functionality.

Fast and eager k-medoids clustering: O(k) runtime improvement of the PAM, CLARA, and CLARANS algorithms

- Computer ScienceInf. Syst.
- 2021

Federated Reconnaissance: Efficient, Distributed, Class-Incremental Learning

- Computer ScienceArXiv
- 2021

This work proposes an evaluation framework and methodological baseline for a system in which each client is expected to learn a growing set of classes and communicate knowledge of those classes efficiently with other clients, such that, after knowledge merging, the clients should be able to accurately discriminate between classes in the superset of classes observed by the set of clients.

Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms

- Computer ScienceSISAP
- 2019

Modifications to the PAM algorithm are proposed to achieve an O(k)-fold speedup in the second SWAP phase of the algorithm, but will still find the same results as the original PAM algorithms.

Operon C++: an efficient genetic programming framework for symbolic regression

- Computer ScienceGECCO Companion
- 2020

Operon is introduced, a C++ GP framework focused on performance, modularity and usability, featuring an efficient linear tree encoding and a scalable concurrency model where each logical thread is responsible for generating a new individual.

A Triangle Inequality for Cosine Similarity

- Computer ScienceSISAP
- 2021

This paper derives a triangle inequality for Cosine similarity that is suitable for efficient similarity search with many standard search structures (such as the VP-tree, Cover- tree, and M-tree); shows that this bound is tight and discusses fast approximations for it.

## References

SHOWING 1-10 OF 33 REFERENCES

Formulas for robust, one-pass parallel computation of covariances and arbitrary-order statistical moments.

- Computer Science
- 2008

A formula for the pairwise update of arbitrary-order centered statistical moments is presented, of particular interest to compute such moments in parallel for large-scale, distributed data sets.

A Closer Look at Variance Implementations In Modern Database Systems

- Computer ScienceSGMD
- 2017

This paper studies variance implementations in various real-world systems and finds that major database systems such as PostgreSQL and most likely System X use a representation that is efficient, but suffers from floating point precision loss resulting from catastrophic cancellation.

Algorithms for Computing the Sample Variance: Analysis and Recommendations

- Computer Science
- 1983

A survey of possible algorithms and their round-off error bounds is presented, including some new analysis for computations with shifted data, and experimental results confirm these bounds and illustrate the dangers of some algorithms.

The (black) art of runtime evaluation: Are we comparing algorithms or implementations?

- Computer ScienceKnowledge and Information Systems
- 2016

This work substantiates its points with extensive experiments, using clustering and outlier detection methods with and without index acceleration, and discusses what one can learn from evaluations, whether experiments are properly designed, and what kind of conclusions one should avoid.

Computing standard deviations: accuracy

- Computer ScienceCACM
- 1979

Four algorithms for the numerical computation of the standard deviation of (unweighted) sampled data are analyzed and it is concluded that all four algorithms will provide accurate answers for many problems, but two of the algorithms are substantially more accurate on difficult problems than are the other two.

Remark on stably updating mean and standard deviation of data

- MathematicsCommun. ACM
- 1975

Although not published as a numbered algorithm, Hanson's article “Stably Updating Mean and Standard Deviation of Data” in the January, 1975, issue of Communications, [1] describes an algorithm for…

Updating formulae and a pairwise algorithm for computing sample variances

- Computer Science, Mathematics
- 1979

A general formula is presented for computing the simple variance for a sample of size m + n given the means and variances for two subsamples of sizes m and n. This formula is used in the construction…

Note on a Method for Calculating Corrected Sums of Squares and Products

- Mathematics
- 1962

In many problems the "corrected sum of squares" of a set of values must be calculated i.e. the sum of squares of the deviations of the values about their mean. The most usual way is to calculate the…

Thirteen ways to look at the correlation coefficient

- Mathematics
- 1988

Abstract In 1885, Sir Francis Galton first defined the term “regression” and completed the theory of bivariate correlation. A decade later, Karl Pearson developed the index that we still use to…

Adaptive Precision Floating-Point Arithmetic and Fast Robust Geometric Predicates

- Computer Science
- 1997

This article proposes a technique for adaptive precision arithmetic that can often speed software-level algorithms for exact addition and multiplication of arbitrary precision floating-point values when they are used to perform multiprecision calculations that do not always require exact arithmetic, but must satisfy some error bound.