Understanding the propagation of hard errors to software and implications for resilient system design

@inproceedings{Li2008UnderstandingTP,
  title={Understanding the propagation of hard errors to software and implications for resilient system design},
  author={Man-Lap Li and Pradeep Ramachandran and S. Sahoo and S. Adve and V. Adve and Yuanyuan Zhou},
  booktitle={ASPLOS 2008},
  year={2008}
}
With continued CMOS scaling, future shipped hardware will be increasingly vulnerable to in-the-field faults. To be broadly deployable, the hardware reliability solution must incur low overheads, precluding use of expensive redundancy. We explore a cooperative hardware-software solution that watches for anomalous software behavior to indicate the presence of hardware faults. Fundamental to such a solution is a characterization of how hardware faults indifferent microarchitectural structures of a… Expand
SWAT : An Error Resilient System
Characterizing the Impact of Intermittent Hardware Faults on Programs
CrashTest'ing SWAT: Accurate, gate-level evaluation of symptom-based resiliency solutions
Characterizing and exploiting application behavior under data corruption
Relyzer: Application Resiliency Analyzer for Transient Faults
Trace-based microarchitecture-level diagnosis of permanent hardware faults
A HW-dependent software model for cross-layer fault analysis in embedded systems
Exploring the Synergy of Emerging Workloads and Silicon Reliability Trends
...
1
2
3
4
5
...

References

SHOWING 1-4 OF 4 REFERENCES
ReStore: Symptom-Based Soft Error Detection in Microprocessors
Perturbation-based Fault Screening
Dynamic Derivation of Application-Specific Error Detectors and their Implementation in Hardware