Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications
@inproceedings{Berrocal2016ExploringPR, title={Exploring Partial Replication to Improve Lightweight Silent Data Corruption Detection for HPC Applications}, author={Eduardo Berrocal and Leonardo Arturo Bautista-Gomez and Sheng Di and Zhiling Lan and Franck Cappello}, booktitle={Euro-Par}, year={2016} }
Silent data corruption SDC poses a great challenge for high-performance computing HPC applications as we move to extreme-scale systems. [] Key Method In this work, we propose partial replication to overcome this limitation. More specifically, we have observed that not all processes of an MPI application experience the same level of data variability at exactly the same time. Thus, we can smartly choose and replicate only those processes for which our lightweight data-analytic detectors would perform poorly…
13 Citations
Toward General Software Level Silent Data Corruption Detection for Parallel Applications
- Computer ScienceIEEE Transactions on Parallel and Distributed Systems
- 2017
This work proposes partial replication to overcome the limitation that not all processes of an MPI application experience the same level of data variability at exactly the same time, and proposes a new evaluation method based on the probability that a corruption will pass unnoticed by a particular detector.
Efficient detection of silent data corruption in HPC applications with synchronization-free message verification
- Computer ScienceJ. Supercomput.
- 2022
This paper proposes the first solution of SDC detection without requiring synchronization and additional message transmission between replicas, which combines message logging with an innovative asynchronous message comparison mechanism, which uses specialized service routines to perform progress comparison without interfering target program execution.
Evaluating the Viability of Using Compression to Mitigate Silent Corruption of Read-Mostly Application Data
- Computer Science2017 IEEE International Conference on Cluster Computing (CLUSTER)
- 2017
This paper proposes an application-independent mechanism to efficiently detect and correct silent data corruption (SDC) in read-mostly memory, where SDC may be most likely to occur, and uses memory protection mechanisms to maintain compressed backups of application memory.
Hardening Strategies for HPC Applications
- Computer ScienceAnais Estendidos do Simpósio em Sistemas Computacionais de Alto Desempenho (WSCAD)
- 2019
This work presents and discusses radiation experiments that cover a total of more than 352,000 years of natural exposure and fault-injection analysis, and proposes and analyzes the impact of selective hardening for HPC algorithms.
Experimental and analytical study of Xeon Phi reliability
- Computer ScienceSC
- 2017
An in-depth analysis of transient faults effects on HPC applications in Intel Xeon Phi processors based on radiation experiments and high-level fault injection is presented and it is shown that portions of applications can be graded by different criticalities.
Scaling and Resilience in Numerical Algorithms for Exascale Computing
- Computer Science
- 2018
A new adaptive scheduler is proposed that optimize the parallel efficiency by minimizing the time-subdomain length without making communication of time- subdomains too costly, and it is demonstrated that it is possible to obtain time-parallel speedup on the nonlinear shallow water equation, beyond what is possible using conventional spatial domain-decomposition techniques alone.
Radiation-Induced Error Criticality in Modern HPC Parallel Accelerators
- Computer Science2017 IEEE International Symposium on High Performance Computer Architecture (HPCA)
- 2017
It is shown that arithmetic operations are less critical for the K40, while Xeon Phi is more reliable when executing particles interactions solved through Finite Difference Methods, and iterative stencil operations seem the most reliable on both architectures.
User-level failure detection and auto-recovery of parallel programs in HPC systems
- Computer ScienceFrontiers Comput. Sci.
- 2021
This paper proposes an innovative method for the unprivileged user to achieve failure detection of job execution and automatic resubmission of failed jobs, and implements the proposed method as a tool named automatic re-launcher (ARL) and evaluates it on the Tianhe-2 system.
Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations
- Computer Science2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)
- 2019
Selective particle replication (SPR) is proposed herein as the first particle-based replication method for detecting SDCs in Smoothed particle hydrodynamics (SPH) simulations, a mesh-free Lagrangian method commonly used to perform hydrodynamic simulations in astrophysics and computational fluid dynamics.
Resiliency in numerical algorithm design for extreme scale simulations
- Computer ScienceDagstuhl Reports
- 2020
A broad range of perspectives are gathered on the role of algorithms, applications and systems in achieving resilience for extreme scale simulations to discuss novel ways to make applications resilient against detected and undetected faults.
References
SHOWING 1-10 OF 27 REFERENCES
Lightweight Silent Data Corruption Detection Based on Runtime Data Analysis for HPC Applications
- Computer ScienceHPDC
- 2015
A pointwise detection model with two phases: one involving the prediction of the next expected value in the time series for each data point, and another determining a range surrounding the predicted next-step value.
Detecting silent data corruption through data dynamic monitoring for scientific applications
- Computer SciencePPoPP '14
- 2014
A novel technique to detect silent data corruption based on data monitoring is proposed and it is shown that this technique can detect up to 50% of injected errors while incurring only negligible overhead.
Detection and correction of silent data corruption for large-scale high-performance computing
- Computer Science2012 International Conference for High Performance Computing, Networking, Storage and Analysis
- 2012
This paper studies the potential for redundancy to detect and correct soft errors in MPI message-passing applications while investigating the challenges inherent to detecting soft errors within MPI applications by providing transparent MPI redundancy.
Processor-Level Selective Replication
- Computer Science37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN'07)
- 2007
A processor-level technique called selective replication, by which the application can choose where in its application stream and to what degree it requires replication, is proposed, which shows that with about 59% less overhead than full duplication, selective replication detects 97% of the data errors and 87%" of the instruction errors that were covered by full duplication.
Programmer-directed partial redundancy for resilient HPC
- Computer ScienceConf. Computing Frontiers
- 2015
This work introduces programmer-directed selective replication mechanism to provide fault-tolerance while decreasing costs, and shows that this scheme detects and corrects around 65% of SDC errors with only 4% overhead.
Proactive process-level live migration in HPC environments
- Computer Science2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
- 2008
A novel process-level live migration mechanism supports continued execution of applications during much of processes migration and integrated into an MPI execution environment to transparently sustain health-inflicted node failures, which eradicates the need to restart and requeue MPI jobs.
Detecting and Correcting Data Corruption in Stencil Applications through Multivariate Interpolation
- Computer Science2015 IEEE International Conference on Cluster Computing
- 2015
This paper exploits multivariate interpolation in order to detect and correct data corruption in stencil applications and demonstrates that this mechanism can detect andCorrect most important corruptions and keep the error deviation under 1% during the entire execution while injecting one corruption per minute.
Opportunistic application-level fault detection through adaptive redundant multithreading
- Computer Science2014 International Conference on High Performance Computing & Simulation (HPCS)
- 2014
This paper presents an application level fault detection approach that is based on adaptive redundant multithreading based on flexible building blocks for application specific fault detection, which makes possible more reasonable performance overheads than full redundancy.
FTI: High performance Fault Tolerance Interface for hybrid systems
- Computer Science2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
- 2011
This work proposes a low-overhead high-frequency multi-level checkpoint technique in which a highly-reliable topology-aware Reed-Solomon encoding in a three- level checkpoint scheme is integrated in the Fault Tolerance Interface FTI.
A Practical Approach for Handling Soft Errors in Iterative Applications
- Computer Science2015 IEEE International Conference on Cluster Computing
- 2015
It is shown that changes in value of the residue can serve as the signature that detect the soft errors that can have the most negative impact on the applications and partial replication is proposed to improve accuracy without very large overheads.