ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms

@article{Hespe2022ReStoreIR,
  title={ReStore: In-Memory REplicated STORagE for Rapid Recovery in Fault-Tolerant Algorithms},
  author={Demian Hespe and Lukas H{\"u}bner and Peter Sanders and Alexandros Stamatakis},
  journal={ArXiv},
  year={2022},
  volume={abs/2203.01107}
}
Fault-tolerant distributed applications require mechanisms to recover data lost via a process failure. On modern cluster systems it is typically impractical to request replacement resources after such a failure. Therefore, applications have to continue working with the remaining resources. This requires redistributing the workload and that the non-failed processes reload the lost data. We present an algorithmic framework and its C++ library implementation ReStore for MPI programs that enables… 

Figures from this paper

References

SHOWING 1-10 OF 32 REFERENCES
Scalable diskless checkpointing for large parallel systems
TLDR
A diskless checkpointing and recovery system is implemented and the results show much greater I/O scalability and higher throughput than disk-based parallel file systems for a large number of clients.
Shrink or Substitute: Handling Process Failures in HPC Systems Using In-Situ Recovery
TLDR
This paper explores the use of fault tolerance extensions to Message Passing Interface called user-level failure mitigation (ULFM) for handling process failures without the need to discard the progress made by the application and demonstrates that graceful degradation is a viable alternative for recovery in environments where spares may not be available.
Exploring Automatic, Online Failure Recovery for Scientific Applications at Extreme Scales
TLDR
Fenix is presented, a framework for enabling recovery from process/node/blade/cabinet failures for MPI-based parallel applications in an online and transparent manner, and relies on application-driven, diskless, implicitly coordinated check pointing.
Fault tolerance for remote memory access programming models
TLDR
This paper designs a model for reasoning about fault tolerance for RMA, and uses this model to construct several highly-scalable mechanisms that provide efficient low-overhead in-memory checkpointing, transparent logging of remote memory accesses, and a scheme for transparent recovery of failed processes.
Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System
TLDR
The Scalable Checkpoint/Restart (SCR) library is designed, a multi-level checkpoint system that writes checkpoints to RAM, Flash, or disk on the compute nodes in addition to the parallel file system that improves efficiency on existing large-scale systems and that this benefit increases as the system size grows.
A scalable and extensible checkpointing scheme for massively parallel simulations
TLDR
A scalable, distributed, diskless, and resilient checkpointing scheme that can create and recover snapshots of a partitioned simulation domain and is fully integrated in a state-of-the-art high-performance multi-physics simulation framework.
Checkpointing Strategies for Shared High-Performance Computing Platforms
TLDR
This work considers different aspects (system-level scheduling policies and hardware) that optimize the overall performance of concurrently executing CR-based applications that share I/O resources, and shows that by combining optimal checkpointing periods with contention-aware system-level I/o scheduling strategies, this work can significantly improve overall application performance and maximize the platform throughput.
A diskless checkpointing algorithm for super-scale architectures applied to the fast fourier transform
  • C. Engelmann, A. Geist
  • Computer Science
    Proceedings of the International Workshop on Challenges of Large Applications in Distributed Environments, 2003.
  • 2003
TLDR
This paper adapts the present technique of disklesscheckpointing to such huge distributed systems in orderto equip existing scientific algorithms with super-scalablefault-tolerance and presents results from an implementation of the Fast Fourier Transform that uses the adapted technique to achieve super-scale fault-t tolerance.
VeloC: Towards High Performance Adaptive Asynchronous Checkpointing at Large Scale
TLDR
A concurrency-optimized technique that combines performance modeling with lightweight monitoring to make informed decisions about what local storage devices to use in order to dynamically adapt to background flushes and reduce the checkpointing overhead is introduced.
Post-failure recovery of MPI communication capability
TLDR
This paper presents a set of extensions to MPI that allow communication capabilities to be restored, while maintaining the extreme level of performance to which MPI users have become accustomed.
...
...