AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing

@inproceedings{Nicolae2013AICkptLM,
  title={AI-Ckpt: leveraging memory access patterns for adaptive asynchronous incremental checkpointing},
  author={Bogdan Nicolae and Franck Cappello},
  booktitle={HPDC},
  year={2013}
}
With increasing scale and complexity of supercomputing and cloud computing architectures, faults are becoming a frequent occurrence, which makes reliability a difficult challenge. Although for some applications it is enough to restart failed tasks, there is a large class of applications where tasks run for a long time or are tightly coupled, thus making a restart from scratch unfeasible. Checkpoint-Restart (CR), the main method to survive failures for such applications faces additional… CONTINUE READING

Citations

Publications citing this paper.
Showing 1-10 of 11 extracted citations

References

Publications referenced by this paper.

Similar Papers

Loading similar papers…