Checkpointing Exascale Memory Systems with Existing Memory Technologies

  title={Checkpointing Exascale Memory Systems with Existing Memory Technologies},
  author={Nilmini Abeyratne and Hsing Min Chen and Byoungchan Oh and Ronald G. Dreslinski and Chaitali Chakrabarti and Trevor N. Mudge},
  journal={Proceedings of the Second International Symposium on Memory Systems},
Building exascale supercomputers requires resilience to failing components such as processor, memory, storage, and network devices. Checkpoint/restart is a key ingredient in attaining resilience, but providing fast and reliable checkpointing is becoming more challenging as the amount of data to checkpoint and the number of components that can fail increase in exascale systems. To improve the speed of checkpointing, emerging non-volatile memory (phase change, magnetic, resistive RAM) have been… Expand
Efficient Fault Tolerance Through Dynamic Node Replacement
A dynamic node replacement algorithm that finds replacement nodes by utilizing the flexibility of moldable and malleable jobs is proposed that can maintain high throughput even when a system is experiencing frequent node failures, thereby making it a perfect technique to complement multi-level checkpointing. Expand
Reliability Issues in the Parallel Dataflow Computing System
A comparison of traditional and dataflow computing models is provided and one of the variants of the system recovery mechanism in the event of a fault or failure of a computational core using the example of an error in the execution unit is given. Expand
Extending Message Passing Interface Windows to Storage
Initial performance results demonstrate that the presented MPI window extension could potentially be helpful for a wide-range of use-cases and with low-overhead. Expand
Studies in Exascale Computer Architecture: Interconnect, Resiliency, and Checkpointing
This document summarizes current capabilities, research and operational priorities, and plans for further studies that were established at the 2015 USGS workshop on quantitative hazard assessments of earthquake-triggered landsliding and liquefaction in the Czech Republic. Expand
Persistent coarrays: integrating MPI storage windows in coarray fortran
Persistent coarrays is proposed, an extension of OpenCoarrays that integrates MPI storage windows to leverage its transport layer and seamlessly map coarray to files on storage and provides clear benefits on representative workloads, while incurring in minimal source code changes. Expand


Optimizing Checkpoints Using NVM as Virtual Memory
NVM-checkpoints reduce the NVM and interconnect bandwidth used with a novel pre-copy mechanism, which incrementally moves checkpoint data from DRAM to NVM before a local checkpoint is started, which results in 40% faster application execution times compared to asynchronous approaches not using pre-copying. Expand
Thinking Beyond the RAM Disk for In-Memory Checkpointing of HPC Applications
With the massive growth in scale and complexity of high performance computing (HPC) systems, long-running scientific parallel applications periodically save the state of their execution to filesExpand
Using multi-level cell STT-RAM for fast and energy-efficient local checkpointing
The experimental results show that the average performance overhead is less than 1% in a multi-programmed four-core process node with a 1-second local checkpoint interval, and the evaluation results demonstrate that using MLC STT-RAM is an energy-efficient solution. Expand
Enhancing Checkpoint Performance with Staging IO and SSD
A new strategy to enhance checkpoint writing performance by aggregating checkpoint writing at client side, and utilizing staging IO on data servers is proposed, which achieves up to 6.3 times higher write bandwidth than a popular parallel file system PVFS2 with 8 client nodes and 4 data servers. Expand
Energy-aware I/O optimization for checkpoint and restart on a NAND flash memory system
The results based on performance studies show that the profile lookup approach can save 4.1% of energy consumption in an application execution with checkpoint/restart, and improves the energy consumption of write operations by 67.4% and read operations by 40.2% on a PCIe-attached NAND flash memory device. Expand
Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
This work uses the upcoming Phase-Change Random Access Memory (PCRAM) technology and proposes a hybrid local/global checkpointing mechanism after a thorough analysis of MPP systems failure rates and failure sources to reduce the checkpoint overhead and offer a smooth transition from the conventional pure HDD checkpoint to the ideal 3D PCRAM mechanism. Expand
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
The Scalable Checkpoint/Restart (SCR) library is designed, a multi-level checkpoint system that writes checkpoints to RAM, Flash, or disk on the compute nodes in addition to the parallel file system that improves efficiency on existing large-scale systems and that this benefit increases as the system size grows. Expand
A 1 PB/s file system to checkpoint three million MPI tasks
A novel user-space file system that stores data in main memory and transparently spills over to other storage, like local flash memory or the parallel file system, as needed, which extends the reach of libraries like SCR to systems where they otherwise could not be used. Expand
Distributed Diskless Checkpoint for Large Scale Systems
This work proposes a fault tolerant model able to tolerate up to 50% of process failures with a low check pointing overhead and uses solid state drives to significantly increase the checkpoint performance and avoid the memory overhead of classic diskless checkpoint. Expand
FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
  • G. Zheng, Lixia Shi, L. Kalé
  • Computer Science
  • 2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935)
  • 2004
FTC-Charms ++ is presented, a fault-tolerant runtime based on a scheme for fast and scalable in-memory checkpoint and restart, useful for applications whose memory footprint is small at the checkpoint state and a variation of this scheme - in-disk checkpoint/restart can be applied to applications with large memory footprint. Expand