User-level checkpoint and recovery for LAM/MPI

  title={User-level checkpoint and recovery for LAM/MPI},
  author={Youhui Zhang and Dongsheng Wong and Weimin Zheng},
  journal={Operating Systems Review},
As high-performance clusters continue to grow in size and popularity, issues of fault tolerance and reliability are becoming limiting factors on application scalability. We integrated one user-level checkpointing and rollback recovery (CRR) library to LAM/MPI, a high performance implementation of the Message Passing Interface (MPI), to improve its availability. Compared with the current CRR implementation of LAM/MPI, our work supports file checkpointing and own higher portability, which can run… CONTINUE READING
Highly Cited
This paper has 25 citations. REVIEW CITATIONS
16 Citations
0 References
Similar Papers


Publications citing this paper.

Similar Papers

Loading similar papers…