Combining Partial Redundancy and Checkpointing for HPC

  title={Combining Partial Redundancy and Checkpointing for HPC},
  author={James Elliott and Kishor Kharbas and David Fiala and Frank Mueller and Kurt B. Ferreira and Christian Engelmann},
  journal={2012 IEEE 32nd International Conference on Distributed Computing Systems},
Today's largest High Performance Computing (HPC) systems exceed one Petaflops (1015 floating point operations per second) and exascale systems are projected within seven years. But reliability is becoming one of the major challenges faced by exascale computing. With billion-core parallelism, the mean time to failure is projected to be in the range of minutes or hours instead of days. Failures are becoming the norm rather than the exception during execution of HPC applications. Current fault… CONTINUE READING
Highly Cited
This paper has 104 citations. REVIEW CITATIONS
71 Extracted Citations
49 Extracted References
Similar Papers

Citing Papers

Publications influenced by this paper.
Showing 1-10 of 71 extracted citations

105 Citations

Citations per Year
Semantic Scholar estimates that this publication has 105 citations based on the available data.

See our FAQ for additional information.

Referenced Papers

Publications referenced by this paper.
Showing 1-10 of 49 references

Coordinated Checkpoint/Restart Process Fault Tolerance for MPI Applications on HPC Systems

  • Joshua Hursey
  • PhD thesis,
  • 2010
Highly Influential
3 Excerpts

Increasing fault resiliency in a message-passing environment

  • Kurt B. Ferreira, Rolf Riesen, +5 authors Ron Brightwell
  • TR SAND2009-6753,
  • 2009
5 Excerpts

, Lori A . Pritchett - Sheats , and Sarah E . Michalak . Application MTTFE vs . platform MTTF : A fresh perspective on system reliability and application throughput for computations at scale

  • Jason Duell
  • Proceedings of the Workshop on Resiliency in High…
  • 2008

Similar Papers

Loading similar papers…