Fault-Tolerance Techniques for High-Performance Computing

@inproceedings{Dongarra2015FaultToleranceTF,
  title={Fault-Tolerance Techniques for High-Performance Computing},
  author={Jack J. Dongarra and Thomas H{\'e}rault and Yves Robert},
  booktitle={Computer Communications and Networks},
  year={2015}
}
This report provides an introduction to resilience methods. The emphasis is on checkpointing, the de-facto standard technique for resilience in High Performance Computing. We present the main two protocols, namely coordinated checkpointing and hierarchical checkpointing. Then we introduce performance models and use them to assess the performance of theses protocols. We cover the Young/Daly formula for the optimal period and much more! Next we explain how the efficiency of checkpointing can be… CONTINUE READING
Highly Cited
This paper has 73 citations. REVIEW CITATIONS
Recent Discussions
This paper has been referenced on Twitter 1 time over the past 90 days. VIEW TWEETS

Citations

Publications citing this paper.
Showing 1-10 of 43 extracted citations

74 Citations

02040201620172018
Citations per Year
Semantic Scholar estimates that this publication has 74 citations based on the available data.

See our FAQ for additional information.

References

Publications referenced by this paper.
Showing 1-10 of 73 references

Similar Papers

Loading similar papers…