Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU

@article{Haque2010HardDO,
  title={Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU},
  author={Imran S. Haque and Vijay S. Pande},
  journal={2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing},
  year={2010},
  pages={691-696}
}
  • I. Haque, V. Pande
  • Published 2 October 2009
  • Computer Science
  • 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
Graphics processing units (GPUs) are gaining widespread use in high-performance computing because of their performance advantages relative to CPUs. However, the reliability of GPUs is largely unproven. In particular, current GPUs lack error checking and correcting (ECC) in their memory subsystems. The impact of this design has not been previously measured at a large enough scale to quantify soft error events. We present MemtestG80, our software for assessing memory error rates on NVIDIA… 

Figures and Tables from this paper

Evaluation and Mitigation of Radiation-Induced Soft Errors in Graphics Processing Units
TLDR
Novel insights on GPU reliability are given by evaluating the neutron sensitivity of modern GPUs memory structures, highlighting pattern dependence and multiple errors occurrences and error-correcting code, algorithm-based fault tolerance, and comparison hardening strategies are presented and evaluated on GPUs through radiation experiments.
CPU-GPU hybrid bidiagonal reduction with soft error resilience
TLDR
This paper presents a design of a bidiagonal reduction algorithm that is resilient to soft errors, and its implementation on hybrid CPU-GPU architectures is described, using Algorithm Based Fault Tolerance combined with reverse computation.
GPUburn: A system to test and mitigate GPU hardware failures
  • D. Defour, Eric Petit
  • Computer Science
    2013 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS)
  • 2013
TLDR
A new methodology to characterize the hardware failures of Nvidia GPUs based on a software micro-benchmarking platform implemented in OpenCL and a methodology to detect, record location of defective units in order to avoid them to ensure the program correctness on such architectures, improving the GPU fault-tolerance capability and lifespan.
Matrix Multiplication on GPUs with On-Line Fault Tolerance
TLDR
The main contribution of the paper is to extend the traditional algorithm-based fault tolerance (ABFT) from offline to online and apply it to matrix multiplication on GPUs.
Multi-faceted microarchitecture level reliability characterization for NVIDIA and AMD GPUs
TLDR
An extended study based on a consolidated workflow for the evaluation of the reliability in correlation with the performance of four GPU architectures and corresponding chips: AMD Southern Islands and NVIDIA G80/GT200/Fermi.
Understanding GPU errors on large-scale HPC systems and the implications for system design and operation
TLDR
A detailed study is presented to provide a thorough understanding of GPU errors on a large-scale GPU-enabled system, and results from extensive neutron-beam tests are presented to measure the resilience of different generations of GPUs.
A large-scale study of soft-errors on GPUs in the field
TLDR
This study characterize and quantify different kinds of soft-errors on the Titan supercomputer's GPU nodes, and uncovers several interesting and previously unknown insights about the characteristics and impact ofsoft-errors.
Towards Building Error Resilient GPGPU Applications
TLDR
A fault injection study to investigate the end-to-end reliability characteristics of GPGPU applications shows that heuristics are able to reduce the SDC causing faults by 60% on average, while incurring reasonable performance overheads (35% to 95%).
Real-world design and evaluation of compiler-managed GPU redundant multithreading
TLDR
This paper presents a real-world design and evaluation of automatic software RMT on GPU hardware, and demonstrates the benefit of architectural support for RMT with a specific example of fast, register-level thread communication.
...
...

References

SHOWING 1-10 OF 53 REFERENCES
A hardware redundancy and recovery mechanism for reliable scientific computation on graphics processors
TLDR
A hardware redundancy-based approach to reliability for general purpose computation on GPUs that requires minimal change to existing GPU architectures and is completely transparent to general graphics and does not affect the performance of the games that drive the market.
DRAM errors in the wild: a large-scale field study
TLDR
Measurements of memory errors in a large fleet of commodity servers over a period of 2.5 years provide strong evidence that memory errors are dominated by hard errors, rather than soft errors, which previous work suspects to be the dominant error mode.
On testing GPU memory for hard and soft errors
TLDR
This short paper reports on an attempt to test GPU memory for both permanent memory errors due to manufacturing defects and prolonged use and soft errorsDue to single radiation events.
Software-Based ECC for GPUs
TLDR
This work adds small program codes to normal CUDA programs that compute ECCs for data residing in graphics memory so that transient bit-flips can be detected or masked and discusses that performance overheads are derived from the cost of ECC computation on GPUs.
The visual vulnerability spectrum: characterizing architectural vulnerability for graphics hardware
TLDR
The Visual Vulnerability Spectrum is introduced by introducing the VVS to analyze the effect of increased transient error rate on graphics processors and suggest several targeted, inexpensive solutions that can mitigate the most egregious of soft error consequences.
Soft errors in electronic memory-a white paper
TLDR
Tests and standards have been developed to meas ure and improve t he resistance of memory chips to alpha particles – but soft errors have not disappeared.
PAPER—Accelerating parallel evaluations of ROCS
TLDR
The design and implementation of PAPER, an open‐source implementation of Gaussian molecular shape overlay for NVIDIA GPUs is described and one to two order‐of‐magnitude speedups on high‐end commodity GPU hardware relative to a reference CPU implementation of the shape overlay algorithm are demonstrated.
Accelerating molecular modeling applications with graphics processors
TLDR
An overview of recent advances in programmable GPUs is presented, with an emphasis on their application to molecular mechanics simulations and the programming techniques required to obtain optimal performance in these cases.
Fast support vector machine training and classification on graphics processors
TLDR
A solver for Support Vector Machine training run on a GPU, using the Sequential Minimal Optimization algorithm and an adaptive first and second order working set selection heuristic, which achieves speedups of 9-35x over LIBSVM running on a traditional processor.
Tutorial on semiconductor memory testing
TLDR
The structure and operation of the main types of semiconductor memory are described, and the different contexts in which memories are tested together with the corresponding different types of tests are described.
...
...