Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU

@article{Yim2011HauberkLS,
  title={Hauberk: Lightweight Silent Data Corruption Error Detector for GPGPU},
  author={Keun Soo Yim and Cuong Manh Pham and Mushfiq Saleheen and Zbigniew T. Kalbarczyk and Ravishankar K. Iyer},
  journal={2011 IEEE International Parallel & Distributed Processing Symposium},
  year={2011},
  pages={287-300}
}
High performance and relatively low cost of GPU-based platforms provide an attractive alternative for general purpose high performance computing (HPC). However, the emerging HPC applications have usually stricter output cor-rectness requirements than typical GPU applications (i.e., 3D graphics). This paper first analyzes the error resiliency of GPGPU platforms using a fault injection tool we have devel-oped for commodity GPU devices. On average, 16-33% of in-jected faults cause silent data… CONTINUE READING

Figures, Tables, Results, and Topics from this paper.

Similar Papers

Citations

Publications citing this paper.
SHOWING 1-10 OF 57 CITATIONS

Characterization of Impact of Transient Faults and Detection of Data Corruption Errors in Large-Scale N-Body Programs Using Graphics Processing Units

  • 2014 IEEE 28th International Parallel and Distributed Processing Symposium
  • 2014
VIEW 6 EXCERPTS
CITES METHODS & BACKGROUND
HIGHLY INFLUENCED

Pluggable Watchdog: Transparent Failure Detection for MPI Programs

  • 2013 IEEE 27th International Symposium on Parallel and Distributed Processing
  • 2013
VIEW 4 EXCERPTS
CITES BACKGROUND
HIGHLY INFLUENCED

Exploring Soft-Error Robust and Energy-Efficient Register File in GPGPUs using Resistive Memory

  • ACM Trans. Design Autom. Electr. Syst.
  • 2016
VIEW 5 EXCERPTS
CITES BACKGROUND & METHODS
HIGHLY INFLUENCED

Soft-error reliability and power co-optimization for GPGPUs register file using resistive memory

  • 2015 Design, Automation & Test in Europe Conference & Exhibition (DATE)
  • 2015
VIEW 4 EXCERPTS
CITES BACKGROUND
HIGHLY INFLUENCED

VOCL-FT: introducing techniques for efficient soft error coprocessor recovery

  • SC15: International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2015
VIEW 3 EXCERPTS
CITES METHODS
HIGHLY INFLUENCED

End-to-End Resilience for HPC Applications

VIEW 2 EXCERPTS
CITES BACKGROUND

Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications

  • 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
  • 2018
VIEW 1 EXCERPT
CITES METHODS

FILTER CITATIONS BY YEAR

2011
2019

CITATION STATISTICS

  • 7 Highly Influenced Citations

References

Publications referenced by this paper.
SHOWING 1-10 OF 24 REFERENCES

Hard Data on Soft Errors: A Large-Scale Assessment of Real-World Error Rates in GPGPU

  • 2010 10th IEEE/ACM International Conference on Cluster, Cloud and Grid Computing
  • 2009
VIEW 9 EXCERPTS
HIGHLY INFLUENTIAL

A high-performance fault-tolerant software framework for memory on commodity GPUs

  • 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS)
  • 2010
VIEW 1 EXCERPT

Measurement-based analysis of fault and error sensitivities of dynamic memory

  • 2010 IEEE/IFIP International Conference on Dependable Systems & Networks (DSN)
  • 2010
VIEW 2 EXCERPTS

Programming Massively Parallel Processors. A Hands-on Approach

  • Scalable Computing: Practice and Experience
  • 2010
VIEW 1 EXCERPT

CheCUDA: A Checkpoint/Restart Tool for CUDA Applications

  • 2009 International Conference on Parallel and Distributed Computing, Applications and Technologies
  • 2009
VIEW 1 EXCERPT

Haque and V . S . Pande , “ Hard Data on Soft Errors : A Large - Scale Assessment of Real - World Error Rates in GPGPU

Y.-K. Kim, B. Chung
  • Proceedings of the IEEE / ACM International Conference on Cluster , Cloud and Grid Computing NVIDIA ’ s Next Generation CUDA Compute Architecture : Fermi , White Paper v 1 . 1
  • 2009