Leveraging near data processing for high-performance checkpoint/restart

  title={Leveraging near data processing for high-performance checkpoint/restart},
  author={Abhinav Agrawal and Gabriel H. Loh and James Tuck},
With the increasing size of HPC systems, the system mean time to interrupt will decrease. This requires checkpoints to be stored in a smaller time when using checkpoint/restart (C/R) for mitigation. Multilevel checkpointing improves C/R efficiency by saving most checkpoints to fast compute-node local storage. But it incurs a high cost for writing a few checkpoints to slow global-I/O. We show that leveraging NDP to offload writing of checkpoints to global-I/O improves C/R efficiency. We explore… CONTINUE READING