Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

@article{Moody2010DesignMA,
  title={Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System},
  author={Adam T. Moody and Greg Bronevetsky and Kathryn Mohror and Bronis R. de Supinski},
  journal={2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis},
  year={2010},
  pages={1-11}
}
  • A. Moody, G. Bronevetsky, B. Supinski
  • Published 13 November 2010
  • Computer Science
  • 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
High-performance computing (HPC) systems are growing more powerful by utilizing more hardware components. As the system mean-time-before-failure correspondingly drops, applications must checkpoint more frequently to make progress. However, as the system memory sizes grow faster than the bandwidth to the parallel file system, the cost of checkpointing begins to dominate application run times. Multi-level checkpointing potentially solves this problem through multiple types of checkpoints with… 

Figures and Tables from this paper

Detailed Modeling and Evaluation of a Scalable Multilevel Checkpointing System
TLDR
A multilevel checkpointing library, the Scalable Checkpoint/Restart (SCR) library, that writes lightweight checkpoints to node-local storage in addition to the parallel file system, and probabilistic Markov models of SCR's performance are presented.
Thinking Beyond the RAM Disk for In-Memory Checkpointing of HPC Applications
TLDR
A novel user-space file system that stores file data in main memory and transparently spills over to other storage like the parallel file system as needed and can be ported to platforms where RAM disk does not exist is implemented.
Optimization of Multi-level Checkpoint Model for Large Scale HPC Applications
TLDR
A mathematical model is built to fit the multi-level checkpoint/restart mechanism with large scale applications regarding various types of failures and optimized selections of levels associated with optimal checkpoint intervals at each level outperforms other state-of-the-art solutions.
The design and implementation of a multi-level content-addressable checkpoint file system
TLDR
Cento is described, a multi-level, content-addressable checkpoint file system for large-scale HPC systems that achieves in-flight checkpoint data reduction across all compute nodes through compression and elimination of duplicate blocks over a series of checkpoints.
Accelerating incremental checkpointing for extreme-scale computing
Design and modeling of a non-blocking checkpointing system
  • Kento Sato, N. Maruyama, S. Matsuoka
  • Computer Science
    2012 International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2012
TLDR
The design of the system is presented, the system can improve efficiency by 1.1 to 2.0x on future machines, and applications using the checkpointing system can achieve high efficiency even when using a PFS with lower bandwidth.
A 1 PB/s file system to checkpoint three million MPI tasks
TLDR
A novel user-space file system that stores data in main memory and transparently spills over to other storage, like local flash memory or the parallel file system, as needed, which extends the reach of libraries like SCR to systems where they otherwise could not be used.
Fault-tolerance for exascale systems
TLDR
This work compares the performance characteristics of uncoordinated CR with message logging, redundant computation, and RAID-inspired, in-memory distributed checkpointing schemes for HPC application patterns on a number of proposed exascale machines to provide valuable guidance on the most efficient resilience methods.
An Analysis of Multilevel Checkpoint Performance Models
TLDR
This work presents a novel execution time prediction model that takes into consideration execution events that have not been considered by previous multilevel checkpointing models and shows how this model can be used to select checkpoint intervals and demonstrates why consideration of these execution events is important.
An Evaluation of Different I/O Techniques for Checkpoint/Restart
TLDR
This paper presents a check pointing technique that significantly reduces the checkpoint overhead and is highly scalable, and shows the approach to have marginal overhead as opposite to standard synchronous check pointing for typical application scenarios.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 35 REFERENCES
Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System
TLDR
The goal is to design light-weight checkpoints to handle the most common failure modes and rely on more expensive checkpoints for less common, but more severe failures, and to develop low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of system failures.
Reliability-Aware Approach: An Incremental Checkpoint/Restart Model in HPC Environments
TLDR
A model that aims to reduce full checkpoint overhead by performing a set of incremental checkpoints between two consecutive full checkpoints is built and a method to find the number of those incremental checkpoints is given.
Using two-level stable storge for efficient checkpointing
TLDR
A two-level stable storage integrating the use of neighbour based with disk based checkpointing is proposed, which combines the advantages of the two schemes: the efficiency of diskless checkpointing with the high reliability of diskbased checkpointing.
A case for two-level distributed recovery schemes
  • N. Vaidya
  • Computer Science
    SIGMETRICS '95/PERFORMANCE '95
  • 1995
TLDR
This paper demonstrates that, it is often advantageous to use "two-level" recovery schemes, which tolerates the more probable failures with low performance overhead, while the less probable failures may be tolerated with a higher overhead.
Processor Allocation and Checkpoint Interval Selection in Cluster Computing Systems
TLDR
This paper presents a performance model for long-running parallel computations that execute with checkpointing enabled, discusses how it is relevant to today's parallel computing environments and software, and presents case studies of using the model to select runtime parameters.
Faster checkpointing with N+1 parity
  • J. Plank, Kai Li
  • Computer Science
    Proceedings of IEEE 24th International Symposium on Fault- Tolerant Computing
  • 1994
TLDR
A basic algorithm is described that uses two extra processors for checkpointing and enables the system to tolerate any single processor failure and the algorithm's speed comes from a combination of N+1 parity, extra physical memory, and virtual memory hardware so that checkpoints need not be written to disk.
A Case of Multi-Level Distributed Recovery Schemes
TLDR
The objective of this report is to motivate research into recovery schemes that can provide multiple levels of fault tolerance, and analyze a hypothetical 2-level recovery scheme that takes two different types of checkpoints, namely, 1-checkpoints and N -checkpoints.
Fault tolerant high performance computing by a coding approach
TLDR
Experimental results demonstrate that the proposed floating-point arithmetic coding approach is able to survive a small number of simultaneous node failures with low performance overhead and little numerical impact.
Zest Checkpoint storage system for large supercomputers
TLDR
The PSC has developed a prototype distributed file system infrastructure that vastly accelerates aggregated write bandwidth on large compute platforms and prototyped a scalable solution that will be directly applicable to future petascale compute platforms having of order 10^6 cores.
Cooperative checkpointing: a robust approach to large-scale systems reliability
TLDR
A simulation-based experimental analysis reveals that cooperative checkpointing enables an application to make progress under a wide variety of failure distributions that periodic checkpointing lacks the flexibility to handle.
...
1
2
3
4
...