Chapter 1 Fault tolerance techniques for high-performance computing

  title={Chapter 1 Fault tolerance techniques for high-performance computing},
  author={Jack Dongarra and Thomas H{\'e}rault and Yves Robert},
This chapter provides an introduction to resilience methods. The emphasis is on checkpointing, the de-facto standard technique for resilience in High Performance Computing. We present the main two protocols, namely coordinated checkpointing and hierarchical checkpointing. Then we introduce performance models and use them to assess the performance of theses protocols. We cover the Young/Daly formula for the optimal period and much more! Next we explain how the efficiency of checkpointing can be… CONTINUE READING
2 Extracted Citations
49 Extracted References
Similar Papers

Referenced Papers

Publications referenced by this paper.
Showing 1-10 of 49 references

Similar Papers

Loading similar papers…