Silent Data Corruption - Myth or reality?

@inproceedings{Constantinescu2008SilentDC,
  title={Silent Data Corruption - Myth or reality?},
  author={Cristian Constantinescu and Ishwar Parulkar and R. Harper and Sarah Ellen Michalak},
  booktitle={DSN},
  year={2008}
}
The higher complexity of the hardware and software employed by modern computing systems, as well as semiconductor technology scaling, are increasing the likelihood of silent data corruption (SDC). SDC occurs when incorrect data is provided to the user, e.g., written to the memory or I/O system, and no error is triggered. Such events may have catastrophic effects, in the case of life critical applications, and represent a significant cost penalty for businesses. The purpose of this panel is to… 

Mimic: Fast Recovery from Data Corruption Errors in Stencil Computations

  • Anis AlazzaweK. Kant
  • Computer Science
    2019 IEEE 38th International Performance Computing and Communications Conference (IPCCC)
  • 2019
TLDR
This paper presents a computational model, refered to as mimic replication, that provides resilience against SDC errors through dynamic reexecution of processes that are vulnerable to having their data tainted due to a detected latent error and provides an analytical model that allows tradeoff between resource and energy consumption and resilience.

Bi-Source Verification Against Silent Data Corruption in High Performance Computing

TLDR
This work focuses on comparing and presenting the advantages and shortcomings of two approaches to overcoming SDC, and shows that from the two proposed methods - threshold triggered and continuous verification - the latter is superior in terms of latency.

Unprotected Computing: A Large-Scale Study of DRAM Raw Error Rate on a Supercomputer

TLDR
The study showed that most multi-bit errors corrupted non-adjacent bits in the memory word and that most errors flipped memory bits from 1 to 0, and proposed several directions in which the findings can help the design of more reliable systems in the future.

Safe limits on voltage reduction efficiency in GPUs: A direct measurement approach

TLDR
The accurate Vmin prediction opens up new possibilities of a cross-layer dynamic guardbanding scheme for GPUs, in which software predicts and manages the voltage guardband, while the functional correctness is ensured by a hardware safety net mechanism.

Resilience of an embedded architecture using hardware redundancy

TLDR
A new element of the system called syndrome is proposed that is the core of a resilient architecture whose software and hardware can adapt to reliable and unreliable environments that provide a more efficient combination of reliability, performance and power consumption than existing techniques.

Scaling and Resilience in Numerical Algorithms for Exascale Computing

TLDR
A new adaptive scheduler is proposed that optimize the parallel efficiency by minimizing the time-subdomain length without making communication of time- subdomains too costly, and it is demonstrated that it is possible to obtain time-parallel speedup on the nonlinear shallow water equation, beyond what is possible using conventional spatial domain-decomposition techniques alone.

Predictive Guardbanding: Program-Driven Timing Margin Reduction for GPUs

TLDR
This article explores the energy benefits of reducing the GPU chip’s voltage to the safe limit, i.e., <inline-formula>, and shows how to use kernels’ microarchitectural performance counters to predict its <tex-math notation="LaTeX">$V_{\min }$ </tex- math></inline- formula> value accurately.

Scalable Algorithmic Detection of Silent Data Corruption for High-Dimensional PDEs

TLDR
This paper shows how to benefit from the numerical properties of a well-established extrapolation method—the combination technique—to make it tolerant to silent data corruption (SDC), and shows that the method has a very good detection rate.

Exploiting GPU Undervoltage to Improve the Energy Efficiency of Deep Learning Applications

TLDR
This work proposes an approach to study the potential energy savings of reducing the supply voltage of General Purpose Graphics Processing Units, using an AMD Radeon Vega Frontier Edition GPGPU and applies it to current deep learning models to provide an insight into their behavior under minimum supply voltage.

Impact of Radiation on Electronics

Exposure to radiation of electronic devices can lead to catastrophic system failures in embedded systems, significantly affecting their reliability. Therefore, prior to the design of a resilient