Reducing Floating Point Error in Dot Product Using the Superblock Family of Algorithms

@article{Castaldo2008ReducingFP,
  title={Reducing Floating Point Error in Dot Product Using the Superblock Family of Algorithms},
  author={Anthony M. Castaldo and R. Clint Whaley and Anthony T. Chronopoulos},
  journal={SIAM J. Sci. Comput.},
  year={2008},
  volume={31},
  pages={1156-1174}
}
This paper discusses both the theoretical and statistical errors obtained by various well-known dot products, from the canonical to pairwise algorithms, and introduces a new and more general framework that we have named superblock which subsumes them and permits a practitioner to make trade-offs between computational performance, memory usage, and error behavior. We show that algorithms with lower error bounds tend to behave noticeably better in practice. Unlike many such error-reducing… 

Figures from this paper

Parallelism and error reduction in a high performance environment
This dissertation details contributions made by the author to the field of computer science while working to improve the performance of a state-of-the-art linear algebra library intended for use on
Improving numerical accuracy for non-negative matrix multiplication on GPUs using recursive algorithms
TLDR
The limitations of hardware to improve accuracy of non-negative matrix multiply are explored by specifically comparing implementations on the GPU and CPU and algorithmic solutions to improveuracy are proposed.
Improving the Accuracy of High Performance BLAS Implementations Using Adaptive Blocked Algorithms
TLDR
The three-level hybrid algorithm presented here not only has up to 10% better performance than the fastest high-performance matrix multiply, but is also more accurate.
The Better Accuracy of Strassen-Winograd Algorithms (FastMMW)
TLDR
It is shown that the maximum absolute error of any FastMMW algorithm can be improved theoretically and empirically by 10% - 20% per recursion (the authors reduce the error by half for 4 recursions).
Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures
TLDR
This paper presents the design and implementation of a Hessenberg reduction algorithm immune to simultaneous soft-errors, capable of taking advantage of hybrid GPU-CPU platforms, and introduces less than 2% performance overhead compared to the optimized, but fault-prone, hybrids.
Efficient generation of sequences of dense linear algebra through auto-tuning
TLDR
A matrix representation and type system is presented that describes basic linear algebra operations, the loops required to implement those operations, and the legality of key optimizations that can match or exceed performance of vendor tuned BLAS libraries, general purpose optimizing compilers, and hand written code.
Accumulation Bit-Width Scaling For Ultra-Low Precision Training Of Deep Networks
TLDR
A statistical approach is presented to analyze the impact of reduced accumulation precision on deep learning training and enables precise tailoring of computation hardware to the application, yielding area- and power-optimal systems.
Algorithm-Based Fault Tolerance for Two-Sided Dense Matrix Factorizations
TLDR
Algorithm-based fault tolerant (ABFT) algorithms for the parallel Hessenberg reduction and the parallel tridiagonal reduction use a combination of ABFT and diskless checkpointing to protect frequently modified data.
Parallel reduction to Hessenberg form with Algorithm-Based Fault Tolerance
TLDR
This paper presents a generic algorithm-based approach capable of making two-sided factorizations resilient and establishes the theoretical proof of the correctness and the numerical stability of the approach in the context of a Hessenberg Reduction.
Inner product computation for sparse iterative solvers on distributed supercomputer
TLDR
Both the analysis and experiments indicates that inner product computation is very likely to be the most challenging kernel for inner product-based iterative solvers to achieve exascale.
...
...

References

SHOWING 1-10 OF 36 REFERENCES
Accurate Sum and Dot Product
Algorithms for summation and dot product of floating-point numbers are presented which are fast in terms of measured computing time. We show that the computed results are as accurate as if computed
A Distillation Algorithm for Floating-Point Summation
TLDR
This paper describes an efficient "distillation" style algorithm which produces a precise sum by exploiting the natural accuracy of compensated cancellation, applicable to all sets of data but particularly appropriate for ill-conditioned data.
Best “ordering” for floating-point addition
TLDR
It is shown that phrasing the question in terms of finding a best ordering is overly restrictive since the most natural and accurate procedures for performing the addition utilize the storage of intermediate sums as well as performing orderings.
Analysis of some known methods of improving the accuracy of floating-point sums
TLDR
The computer-oriented parity arithmetic is not commonly known, but it has some desirable properties, as this paper will demonstrate.
Accurate and Efficient Floating Point Summation
TLDR
Several simple algorithms for accurately computing the sum of n floating point numbers using a wider accumulator are presented and how the cost of sorting can be reduced or eliminated while retaining accuracy is investigated.
A comparison of floating point summation methods
TLDR
This note compares the schemes by Linz and Kahan with the straight recursive summation of pairwise summing of numbers to reduce accumulated roundoff error.
Exploiting fast matrix multiplication within the level 3 BLAS
TLDR
Algorithm for the BLAS3 operations that are asymptotically faster than the conventional ones are described, based on Strassen's method for fast matrix multiplication, which is now recognized to be a practically useful technique once matrix dimensions exceed about 100.
Fast and Accurate Floating Point Summation with Application to Computational Geometry
TLDR
The results show that in the absence of massive cancellation (the most common case) the cost of guaranteed accuracy is about 30–40% more than the straightforward summation, and the accurate summation algorithm improves the existing algorithm by a factor of two on a nearly coplanar set of points.
Solving Triangular Systems More Accurately and Efficiently
TLDR
An algorithm that solves linear triangular systems accurately and efficiently and that its implementation should run faster than the corresponding XBLAS routine with the same output accuracy is presented.
A comparison of methods for accurate summation
TLDR
It is found that the method of "Cascading Accumulators" is the fastest of several methods, and the Double Compensation method (in both single and double precision versions) is also perfectly accurate in all the tests performed.
...
...