Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
@article{Moody2010DesignMA, title={Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System}, author={Adam T. Moody and Greg Bronevetsky and Kathryn Mohror and Bronis R. de Supinski}, journal={2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis}, year={2010}, pages={1-11} }
High-performance computing (HPC) systems are growing more powerful by utilizing more hardware components. As the system mean-time-before-failure correspondingly drops, applications must checkpoint more frequently to make progress. However, as the system memory sizes grow faster than the bandwidth to the parallel file system, the cost of checkpointing begins to dominate application run times. Multi-level checkpointing potentially solves this problem through multiple types of checkpoints with…
Figures and Tables from this paper
506 Citations
Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System
- Computer ScienceIEEE Transactions on Parallel and Distributed Systems
- 2014
A multilevel checkpointing library, the Scalable Checkpoint/Restart (SCR) library, that writes lightweight checkpoints to node-local storage in addition to the parallel file system, and probabilistic Markov models of SCR's performance are presented.
Thinking Beyond the RAM Disk for In-Memory Checkpointing of HPC Applications
- Computer Science
- 2013
A novel user-space file system that stores file data in main memory and transparently spills over to other storage like the parallel file system as needed and can be ported to platforms where RAM disk does not exist is implemented.
Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications
- Computer Science2014 IEEE 28th International Parallel and Distributed Processing Symposium
- 2014
A mathematical model is built to fit the multi-level checkpoint/restart mechanism with large scale applications regarding various types of failures and optimized selections of levels associated with optimal checkpoint intervals at each level outperforms other state-of-the-art solutions.
The design and implementation of a multi-level content-addressable checkpoint file system
- Computer Science2012 19th International Conference on High Performance Computing
- 2012
Cento is described, a multi-level, content-addressable checkpoint file system for large-scale HPC systems that achieves in-flight checkpoint data reduction across all compute nodes through compression and elimination of duplicate blocks over a series of checkpoints.
Accelerating incremental checkpointing for extreme-scale computing
- Computer ScienceFuture Gener. Comput. Syst.
- 2014
Design and modeling of a non-blocking checkpointing system
- Computer Science2012 International Conference for High Performance Computing, Networking, Storage and Analysis
- 2012
The design of the system is presented, the system can improve efficiency by 1.1 to 2.0x on future machines, and applications using the checkpointing system can achieve high efficiency even when using a PFS with lower bandwidth.
A 1 PB/s file system to checkpoint three million MPI tasks
- Computer ScienceHPDC
- 2013
A novel user-space file system that stores data in main memory and transparently spills over to other storage, like local flash memory or the parallel file system, as needed, which extends the reach of libraries like SCR to systems where they otherwise could not be used.
Fault-tolerance for exascale systems
- Computer Science2010 IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS)
- 2010
This work compares the performance characteristics of uncoordinated CR with message logging, redundant computation, and RAID-inspired, in-memory distributed checkpointing schemes for HPC application patterns on a number of proposed exascale machines to provide valuable guidance on the most efficient resilience methods.
An Analysis of Multilevel Checkpoint Performance Models
- Computer Science2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
- 2018
This work presents a novel execution time prediction model that takes into consideration execution events that have not been considered by previous multilevel checkpointing models and shows how this model can be used to select checkpoint intervals and demonstrates why consideration of these execution events is important.
An Evaluation of Different I/O Techniques for Checkpoint/Restart
- Computer Science2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum
- 2013
This paper presents a check pointing technique that significantly reduces the checkpoint overhead and is highly scalable, and shows the approach to have marginal overhead as opposite to standard synchronous check pointing for typical application scenarios.
References
SHOWING 1-10 OF 35 REFERENCES
Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System
- Computer Science
- 2010
The goal is to design light-weight checkpoints to handle the most common failure modes and rely on more expensive checkpoints for less common, but more severe failures, and to develop low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of system failures.
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments
- Computer Science2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)
- 2008
A model that aims to reduce full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints is built and a method to find the number of those incremental checkpoints is given.
Using two-level stable storge for efficient checkpointing
- Computer ScienceIEE Proc. Softw.
- 1998
A two-level stable storage integrating the use of neighbour based with disk based checkpointing is proposed, which combines the advantages of the two schemes: the efficiency of diskless checkpointing with the high reliability of diskbased checkpointing.
A case for two-level distributed recovery schemes
- Computer ScienceSIGMETRICS '95/PERFORMANCE '95
- 1995
This paper demonstrates that, it is often advantageous to use "two-level" recovery schemes, which tolerates the more probable failures with low performance overhead, while the less probable failures may be tolerated with a higher overhead.
Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems
- Computer ScienceJ. Parallel Distributed Comput.
- 2001
This paper presents a performance model for long-running parallel computations that execute with checkpointing enabled, discusses how it is relevant to today's parallel computing environments and software, and presents case studies of using the model to select runtime parameters.
Faster checkpointing with N+1 parity
- Computer ScienceProceedings of IEEE 24th International Symposium on Fault- Tolerant Computing
- 1994
A basic algorithm is described that uses two extra processors for checkpointing and enables the system to tolerate any single processor failure and the algorithm's speed comes from a combination of N+1 parity, extra physical memory, and virtual memory hardware so that checkpoints need not be written to disk.
A Case of Multi-Level Distributed Recovery Schemes
- Computer Science
- 2001
The objective of this report is to motivate research into recovery schemes that can provide multiple levels of fault tolerance, and analyze a hypothetical 2-level recovery scheme that takes two different types of checkpoints, namely, 1-checkpoints and N -checkpoints.
Fault tolerant high performance computing by a coding approach
- Computer SciencePPOPP
- 2005
Experimental results demonstrate that the proposed floating-point arithmetic coding approach is able to survive a small number of simultaneous node failures with low performance overhead and little numerical impact.
Zest Checkpoint storage system for large supercomputers
- Computer Science2008 3rd Petascale Data Storage Workshop
- 2008
The PSC has developed a prototype distributed file system infrastructure that vastly accelerates aggregated write bandwidth on large compute platforms and prototyped a scalable solution that will be directly applicable to future petascale compute platforms having of order 10^6 cores.
Cooperative checkpointing: a robust approach to large-scale systems reliability
- Computer ScienceICS '06
- 2006
A simulation-based experimental analysis reveals that cooperative checkpointing enables an application to make progress under a wide variety of failure distributions that periodic checkpointing lacks the flexibility to handle.