End-to-End Resilience for HPC Applications

  title={End-to-End Resilience for HPC Applications},
  author={Arash Rezaei and Harsh Khetawat and Onkar Patil and Frank Mueller and Paul H. Hargrove and Eric Roman},
A plethora of resilience techniques have been investigated to protect application kernels. If, however, such techniques are combined and they interact across kernels, new vulnerability windows are created. This work contributes the idea of end-to-end resilience by protecting windows of vulnerability between kernels guarded by different resilience techniques. It introduces the live vulnerability factor (LVF), a new metric that quantifies any lack of end-to-end protection for a given data… 
3 Citations

TeaMPI—Replication-Based Resilience Without the (Performance) Pain

This work employs a weakly consistent data model where replicas run independently yet inform each other through heartbeat messages whether they are still up and running, and introduces a novel algorithmic idea where replication reduces the time-to-solution.

High Performance Computing: 35th International Conference, ISC High Performance 2020, Frankfurt/Main, Germany, June 22–25, 2020, Proceedings

FASTHash is developed, a “truly” high throughput parallel hash table implementation using FPGA on-chip SRAM and provides theoretical worst case bound on the number of erroneous queries (true negative search, duplicate inserts) due to relaxed eventual consistency.

Coded QR Decomposition

It is proved a condition for a checksum-generator matrix to restore the degraded orthogonality of the decoded Q through low-cost post-processing, and a Checksum-Generator matrix for single-node failures is constructed.



Quantitatively Modeling Application Resilience with the Data Vulnerability Factor

This paper introduces a data-driven, practical methodology to analyze these application vulnerabilities using a novel resilience metric: the data vulnerability factor (DVF), which integrates knowledge from both the application and target hardware into the calculation.

Pragma-Controlled Source-to-Source Code Transformations for Robust Application Execution

Preliminary results of the use of a subset of pragma directives for a simple implementation of the conjugate-gradient numerical solver in the presence of uncorrected memory errors are presented, showing that it is possible to implement simple recovery strategies with very low programmer effort and execution time overhead.

Software-based dynamic reliability management for GPU applications

A flexible, automated software-based DRM framework that can provide an adaptable, cost-effective approach to scaling reliability of large systems and guides selective injection of code implementing SRE techniques to protect the most vulnerable data.

Design for a Soft Error Resilient Dynamic Task-Based Runtime

This paper explores three additions to a dynamic task-based runtime to build a generic framework providing soft error resilience to task- based programming paradigms, and demonstrates the overhead introduced by such mechanisms.

File I/O for MPI Applications in Redundant Execution Scenarios

  • S. BöhmC. Engelmann
  • Computer Science
    2012 20th Euromicro International Conference on Parallel, Distributed and Network-based Processing
  • 2012
The results show the performance impact for redundantly accessing a shared networked file system, but also demonstrate the capability to regain performance by utilizing MPI communication between replicas and parallel file I/O.

Fault Tolerance in an Inner-Outer Solver: A GVR-Enabled Case Study

It is shown that single bit flip errors in the Flexible GMRES iterative linear solver can lead to high computational overhead or even failure to converge to the right answer, and several strategies for fault tolerance in both inner and outer solvers appropriate across a range of error rates are designed.

Improving Application Resilience through Probabilistic Task Replication

It is demonstrated that the resilience index can help to better define the tradeoffs for the designers of future systems and developers of parallel software.

Detailed Modeling, Design, and Evaluation of a Scalable Multi-level Checkpointing System

The goal is to design light-weight checkpoints to handle the most common failure modes and rely on more expensive checkpoints for less common, but more severe failures, and to develop low-cost checkpoint schemes that are 100x-1000x faster than the parallel file system and effective against 85% of system failures.

Eliminating microarchitectural dependency from Architectural Vulnerability

  • Vilas SridharanD. Kaeli
  • Computer Science
    2009 IEEE 15th International Symposium on High Performance Computer Architecture
  • 2009
This work demonstrates that the new Program Vulnerability Factor (PVF) metric provides such a basis: PVF captures the architecture-level fault masking inherent in a program, allowing software designers to make quantitative statements about a program's tolerance to soft errors.

Design, Modeling, and Evaluation of a Scalable Multi-level Checkpointing System

The Scalable Checkpoint/Restart (SCR) library is designed, a multi-level checkpoint system that writes checkpoints to RAM, Flash, or disk on the compute nodes in addition to the parallel file system that improves efficiency on existing large-scale systems and that this benefit increases as the system size grows.