# Reducing Floating Point Error in Dot Product Using the Superblock Family of Algorithms

@article{Castaldo2008ReducingFP, title={Reducing Floating Point Error in Dot Product Using the Superblock Family of Algorithms}, author={Anthony M. Castaldo and R. Clint Whaley and Anthony T. Chronopoulos}, journal={SIAM J. Sci. Comput.}, year={2008}, volume={31}, pages={1156-1174} }

This paper discusses both the theoretical and statistical errors obtained by various well-known dot products, from the canonical to pairwise algorithms, and introduces a new and more general framework that we have named superblock which subsumes them and permits a practitioner to make trade-offs between computational performance, memory usage, and error behavior. We show that algorithms with lower error bounds tend to behave noticeably better in practice. Unlike many such error-reducing…

## Figures from this paper

## 20 Citations

Parallelism and error reduction in a high performance environment

- Computer Science
- 2010

This dissertation details contributions made by the author to the field of computer science while working to improve the performance of a state-of-the-art linear algebra library intended for use on…

Improving numerical accuracy for non-negative matrix multiplication on GPUs using recursive algorithms

- Computer ScienceICS '13
- 2013

The limitations of hardware to improve accuracy of non-negative matrix multiply are explored by specifically comparing implementations on the GPU and CPU and algorithmic solutions to improveuracy are proposed.

Improving the Accuracy of High Performance BLAS Implementations Using Adaptive Blocked Algorithms

- Computer Science2011 23rd International Symposium on Computer Architecture and High Performance Computing
- 2011

The three-level hybrid algorithm presented here not only has up to 10% better performance than the fastest high-performance matrix multiply, but is also more accurate.

The Better Accuracy of Strassen-Winograd Algorithms (FastMMW)

- Computer Science
- 2014

It is shown that the maximum absolute error of any FastMMW algorithm can be improved theoretically and empirically by 10% - 20% per recursion (the authors reduce the error by half for 4 recursions).

Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures

- Computer Science2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
- 2016

This paper presents the design and implementation of a Hessenberg reduction algorithm immune to simultaneous soft-errors, capable of taking advantage of hybrid GPU-CPU platforms, and introduces less than 2% performance overhead compared to the optimized, but fault-prone, hybrids.

Efficient generation of sequences of dense linear algebra through auto-tuning

- Computer Science
- 2012

A matrix representation and type system is presented that describes basic linear algebra operations, the loops required to implement those operations, and the legality of key optimizations that can match or exceed performance of vendor tuned BLAS libraries, general purpose optimizing compilers, and hand written code.

Accumulation Bit-Width Scaling For Ultra-Low Precision Training Of Deep Networks

- Computer ScienceICLR
- 2019

A statistical approach is presented to analyze the impact of reduced accumulation precision on deep learning training and enables precise tailoring of computation hardware to the application, yielding area- and power-optimal systems.

Algorithm-Based Fault Tolerance for Two-Sided Dense Matrix Factorizations

- Computer Science
- 2015

Algorithm-based fault tolerant (ABFT) algorithms for the parallel Hessenberg reduction and the parallel tridiagonal reduction use a combination of ABFT and diskless checkpointing to protect frequently modified data.

Parallel reduction to Hessenberg form with Algorithm-Based Fault Tolerance

- Computer Science2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
- 2013

This paper presents a generic algorithm-based approach capable of making two-sided factorizations resilient and establishes the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction.

Inner product computation for sparse iterative solvers on distributed supercomputer

- Computer Science
- 2012

Both the analysis and experiments indicates that inner product computation is very likely to be the most challenging kernel for inner product-based iterative solvers to achieve exascale.

## References

SHOWING 1-10 OF 36 REFERENCES

Accurate Sum and Dot Product

- Computer ScienceSIAM J. Sci. Comput.
- 2005

Algorithms for summation and dot product of floating-point numbers are presented which are fast in terms of measured computing time. We show that the computed results are as accurate as if computed…

A Distillation Algorithm for Floating-Point Summation

- Computer ScienceSIAM J. Sci. Comput.
- 1999

This paper describes an efficient "distillation" style algorithm which produces a precise sum by exploiting the natural accuracy of compensated cancellation, applicable to all sets of data but particularly appropriate for ill-conditioned data.

Best “ordering” for floating-point addition

- BusinessTOMS
- 1988

It is shown that phrasing the question in terms of finding a best ordering is overly restrictive since the most natural and accurate procedures for performing the addition utilize the storage of intermediate sums as well as performing orderings.

Analysis of some known methods of improving the accuracy of floating-point sums

- Computer Science
- 1974

The computer-oriented parity arithmetic is not commonly known, but it has some desirable properties, as this paper will demonstrate.

Accurate and Efficient Floating Point Summation

- Computer ScienceSIAM J. Sci. Comput.
- 2004

Several simple algorithms for accurately computing the sum of n floating point numbers using a wider accumulator are presented and how the cost of sorting can be reduced or eliminated while retaining accuracy is investigated.

A comparison of floating point summation methods

- Computer ScienceCACM
- 1972

This note compares the schemes by Linz and Kahan with the straight recursive summation of pairwise summing of numbers to reduce accumulated roundoff error.

Exploiting fast matrix multiplication within the level 3 BLAS

- Computer ScienceTOMS
- 1990

Algorithm for the BLAS3 operations that are asymptotically faster than the conventional ones are described, based on Strassen's method for fast matrix multiplication, which is now recognized to be a practically useful technique once matrix dimensions exceed about 100.

Fast and Accurate Floating Point Summation with Application to Computational Geometry

- Computer ScienceNumerical Algorithms
- 2004

The results show that in the absence of massive cancellation (the most common case) the cost of guaranteed accuracy is about 30–40% more than the straightforward summation, and the accurate summation algorithm improves the existing algorithm by a factor of two on a nearly coplanar set of points.

Solving Triangular Systems More Accurately and Efficiently

- Computer Science
- 2005

An algorithm that solves linear triangular systems accurately and efficiently and that its implementation should run faster than the corresponding XBLAS routine with the same output accuracy is presented.

A comparison of methods for accurate summation

- Computer ScienceSIGS
- 2004

It is found that the method of "Cascading Accumulators" is the fastest of several methods, and the Double Compensation method (in both single and double precision versions) is also perfectly accurate in all the tests performed.