Building a Fault Tolerant MPI Application: A Ring Communication Example

@article{Hursey2011BuildingAF,
  title={Building a Fault Tolerant MPI Application: A Ring Communication Example},
  author={Joshua Hursey and Richard L. Graham},
  journal={2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum},
  year={2011},
  pages={1549-1556}
}
Process failure is projected to become a normal event for many long running and scalable High Performance Computing (HPC) applications. As such many application developers are investigating Algorithm Based Fault Tolerance (ABFT) techniques to improve the efficiency of application recovery beyond what existing checkpoint/restart techniques alone can provide. Unfortunately for these application developers the libraries that their applications depend upon, like Message Passing Interface (MPI), do… CONTINUE READING
Highly Cited
This paper has 18 citations. REVIEW CITATIONS
12 Citations
28 References
Similar Papers

Citations

Publications citing this paper.
Showing 1-10 of 12 extracted citations

References

Publications referenced by this paper.
Showing 1-10 of 28 references

Run-though stabilization interfaces and semantics

  • Fault Tolerance Working Group
  • svn.mpi-forum.org/trac/mpi-forum-web/wiki/ft/ run…
Highly Influential
3 Excerpts

Similar Papers

Loading similar papers…