Self-Checkpoint: An In-Memory Checkpoint Method Using Less Space and Its Practice on Fault-Tolerant HPL

@inproceedings{Tang2017SelfCheckpointAI,
  title={Self-Checkpoint: An In-Memory Checkpoint Method Using Less Space and Its Practice on Fault-Tolerant HPL},
  author={Xiongchao Tang and Jidong Zhai and Bowen Yu and Wenguang Chen and Weimin Zheng},
  booktitle={PPOPP},
  year={2017}
}
Fault tolerance is increasingly important in high performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless checkpoint has gained extensive attention as a solution to avoid the IO bottleneck of traditional disk-based checkpoint methods. However, applications using previous in-memory checkpoint suffer from little available memory space. To provide high reliability, previous in-memory checkpoint methods either need to keep two copies… CONTINUE READING
3 Citations
7 References
Similar Papers

Citations

Publications citing this paper.

References

Publications referenced by this paper.
Showing 1-7 of 7 references

Berkeley lab checkpoint/restart (blcr) for linux clusters

  • P. H. Hargrove, J. C. Duell
  • In Journal of Physics: Conference Series,
  • 2006
Highly Influential
6 Excerpts

Similar Papers

Loading similar papers…