Correcting soft errors online in LU factorization

  title={Correcting soft errors online in LU factorization},
  author={Teresa Davies and Zizhong Chen},
In high-performance systems, the probability of failure is higher with more processors. Errors in calculations may occur that cannot be detected by outside means. To address this problem, we create a checksum-based approach that detects and recovers from calculation errors. We apply this approach to the LU factorization algorithm used by High Performance Linpack. Our approach has low overhead; in contrast to an existing approach that requires repeated calculation, it repeats only a fraction of… CONTINUE READING
Highly Cited
This paper has 52 citations. REVIEW CITATIONS


Publications citing this paper.
Showing 1-10 of 36 extracted citations

Resilient Iterative Linear Solvers Running Through Errors

View 8 Excerpts
Highly Influenced

Asynchronous and Exact Forward Recovery for Detected Errors in Iterative Solvers

IEEE Transactions on Parallel and Distributed Systems • 2018
View 1 Excerpt

Energy Analysis and Optimization for Resilient Scalable Linear Systems

2018 IEEE International Conference on Cluster Computing (CLUSTER) • 2018
View 1 Excerpt

Algorithm-Directed Crash Consistence in Non-volatile Memory for HPC

2017 IEEE International Conference on Cluster Computing (CLUSTER) • 2017
View 2 Excerpts

52 Citations

Citations per Year
Semantic Scholar estimates that this publication has 52 citations based on the available data.

See our FAQ for additional information.

Similar Papers

Loading similar papers…