User Level Failure Mitigation in MPI

  title={User Level Failure Mitigation in MPI},
  author={Wesley Bland},
  booktitle={Euro-Par Workshops},
In a constant effort to deliver steady performance improvements, the size of High Performance Computing (HPC) systems, as observed by the Top 500 ranking, has grown tremendously over the last decade. This trend, along with the resultant decrease of the Mean Time Between Failure (MTBF), is unlikely to stop; thereby many computing nodes will inevitably fail during application execution [5]. It is alarming that most popular fault tolerant approaches see their efficiency plummet at Exascale [3, 4… CONTINUE READING
Highly Cited
This paper has 23 citations. REVIEW CITATIONS


Publications citing this paper.
Showing 1-10 of 15 extracted citations

Design and Evaluation of FA-MPI, a Transactional Resilience Scheme for Non-blocking MPI

2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks • 2014
View 13 Excerpts
Highly Influenced

A Malleable and Fault-Tolerant Task Pool Framework for X10

2017 IEEE International Conference on Cluster Computing (CLUSTER) • 2017
View 2 Excerpts
Highly Influenced


Publications referenced by this paper.
Showing 1-9 of 9 references

A proposal for User-Level Failure Mitigation in the MPI-3 standard

W. Bland, G. Bosilca, A. Bouteiller, T. Herault, J. Dongarra
Tech. rep., Department of Electrical Engineering and Computer Science, University of Tennessee • 2012
View 11 Excerpts
Highly Influenced

Unified model for assessing checkpointing protocols at extreme-scale

Concurrency and Computation: Practice and Experience • 2014
View 1 Excerpt

Toward Exascale Resilience

IJHPCA • 2009
View 1 Excerpt

Similar Papers

Loading similar papers…