Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications

  title={Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications},
  author={Eduardo Berrocal and Leonardo Arturo Bautista-Gomez and Sheng Di and Zhiling Lan and Franck Cappello},
Silent data corruption SDC poses a great challenge for high-performance computing HPC applications as we move to extreme-scale systems. [] Key Method In this work, we propose partial replication to overcome this limitation. More specifically, we have observed that not all processes of an MPI application experience the same level of data variability at exactly the same time. Thus, we can smartly choose and replicate only those processes for which our lightweight data-analytic detectors would perform poorly…

Toward General Software Level Silent Data Corruption Detection for Parallel Applications

This work proposes partial replication to overcome the limitation that not all processes of an MPI application experience the same level of data variability at exactly the same time, and proposes a new evaluation method based on the probability that a corruption will pass unnoticed by a particular detector.

Efficient detection of silent data corruption in HPC applications with synchronization-free message verification

This paper proposes the first solution of SDC detection without requiring synchronization and additional message transmission between replicas, which combines message logging with an innovative asynchronous message comparison mechanism, which uses specialized service routines to perform progress comparison without interfering target program execution.

Evaluating the Viability of Using Compression to Mitigate Silent Corruption of Read-Mostly Application Data

This paper proposes an application-independent mechanism to efficiently detect and correct silent data corruption (SDC) in read-mostly memory, where SDC may be most likely to occur, and uses memory protection mechanisms to maintain compressed backups of application memory.

Hardening Strategies for HPC Applications

This work presents and discusses radiation experiments that cover a total of more than 352,000 years of natural exposure and fault-injection analysis, and proposes and analyzes the impact of selective hardening for HPC algorithms.

Experimental and analytical study of Xeon Phi reliability

An in-depth analysis of transient faults effects on HPC applications in Intel Xeon Phi processors based on radiation experiments and high-level fault injection is presented and it is shown that portions of applications can be graded by different criticalities.

Scaling and Resilience in Numerical Algorithms for Exascale Computing

A new adaptive scheduler is proposed that optimize the parallel efficiency by minimizing the time-subdomain length without making communication of time- subdomains too costly, and it is demonstrated that it is possible to obtain time-parallel speedup on the nonlinear shallow water equation, beyond what is possible using conventional spatial domain-decomposition techniques alone.

Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators

It is shown that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods, and iterative stencil operations seem the most reliable on both architectures.

User-level failure detection and auto-recovery of parallel programs in HPC systems

This paper proposes an innovative method for the unprivileged user to achieve failure detection of job execution and automatic resubmission of failed jobs, and implements the proposed method as a tool named automatic re-launcher (ARL) and evaluates it on the Tianhe-2 system.

Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations

Selective particle replication (SPR) is proposed herein as the first particle-based replication method for detecting SDCs in Smoothed particle hydrodynamics (SPH) simulations, a mesh-free Lagrangian method commonly used to perform hydrodynamic simulations in astrophysics and computational fluid dynamics.

Resiliency in numerical algorithm design for extreme scale simulations

A broad range of perspectives are gathered on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations to discuss novel ways to make applications resilient against detected and undetected faults.



Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications

A pointwise detection model with two phases: one involving the prediction of the next expected value in the time series for each data point, and another determining a range surrounding the predicted next-step value.

Detecting silent data corruption through data dynamic monitoring for scientific applications

A novel technique to detect silent data corruption based on data monitoring is proposed and it is shown that this technique can detect up to 50% of injected errors while incurring only negligible overhead.

Detection and correction of silent data corruption for large-scale high-performance computing

  • David Fiala
  • Computer Science
    2012 International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2012
This paper studies the potential for redundancy to detect and correct soft errors in MPI message-passing applications while investigating the challenges inherent to detecting soft errors within MPI applications by providing transparent MPI redundancy.

Processor-Level Selective Replication

A processor-level technique called selective replication, by which the application can choose where in its application stream and to what degree it requires replication, is proposed, which shows that with about 59% less overhead than full duplication, selective replication detects 97% of the data errors and 87%" of the instruction errors that were covered by full duplication.

Programmer-directed partial redundancy for resilient HPC

This work introduces programmer-directed selective replication mechanism to provide fault-tolerance while decreasing costs, and shows that this scheme detects and corrects around 65% of SDC errors with only 4% overhead.

Proactive process-level live migration in HPC environments

A novel process-level live migration mechanism supports continued execution of applications during much of processes migration and integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs.

Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation

This paper exploits multivariate interpolation in order to detect and correct data corruption in stencil applications and demonstrates that this mechanism can detect andCorrect most important corruptions and keep the error deviation under 1% during the entire execution while injecting one corruption per minute.

Opportunistic application-level fault detection through adaptive redundant multithreading

This paper presents an application level fault detection approach that is based on adaptive redundant multithreading based on flexible building blocks for application specific fault detection, which makes possible more reasonable performance overheads than full redundancy.

FTI: High performance Fault Tolerance Interface for hybrid systems

This work proposes a low-overhead high-frequency multi-level checkpoint technique in which a highly-reliable topology-aware Reed-Solomon encoding in a three- level checkpoint scheme is integrated in the Fault Tolerance Interface FTI.

A Practical Approach for Handling Soft Errors in Iterative Applications

It is shown that changes in value of the residue can serve as the signature that detect the soft errors that can have the most negative impact on the applications and partial replication is proposed to improve accuracy without very large overheads.