Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra

@inproceedings{Wu2016TowardsPA,
  title={Towards Practical Algorithm Based Fault Tolerance in Dense Linear Algebra},
  author={Panruo Wu and Qiang Guan and Nathan DeBardeleben and Sean Blanchard and Dingwen Tao and Xin Liang and Jieyang Chen and Zizhong Chen},
  booktitle={HPDC '16},
  year={2016}
}
  • Panruo Wu, Qiang Guan, +5 authors Zizhong Chen
  • Published in HPDC '16 2016
  • Computer Science
  • Algorithm based fault tolerance (ABFT) attracts renewed interest for its extremely low overhead and good scalability. However the fault model used to design ABFT has been either abstract, simplistic, or both, leaving a gap between what occurs at the architecture level and what the algorithm expects. As the fault model is the deciding factor in choosing an effective checksum scheme, the resulting ABFT techniques have seen limited impact in practice. In this paper we seek to close the gap by… CONTINUE READING

    Create an AI-powered research feed to stay up to date with new papers like this posted to ArXiv

    Citations

    Publications citing this paper.
    SHOWING 1-10 OF 18 CITATIONS

    Fault Tolerant One-sided Matrix Decompositions on Heterogeneous Systems with GPUs

    VIEW 20 EXCERPTS
    CITES BACKGROUND & METHODS

    TSM 2 : Optimizing Tall-and-Skinny Matrix-Matrix Multiplication on GPUs Anonymous

    • 2019
    VIEW 4 EXCERPTS
    CITES METHODS & BACKGROUND
    HIGHLY INFLUENCED

    Algorithm-Based Fault Tolerance for Convolutional Neural Networks

    VIEW 2 EXCERPTS
    CITES BACKGROUND & METHODS

    TSM2: optimizing tall-and-skinny matrix-matrix multiplication on GPUs

    VIEW 3 EXCERPTS
    CITES BACKGROUND

    Energy Analysis and Optimization for Resilient Scalable Linear Systems

    VIEW 1 EXCERPT
    CITES BACKGROUND

    References

    Publications referenced by this paper.
    SHOWING 1-9 OF 9 REFERENCES

    An Analysis of Algorithm-Based Fault Tolerance Techniques

    VIEW 5 EXCERPTS
    HIGHLY INFLUENTIAL

    Algorithm-Based Fault Tolerance for Matrix Operations

    VIEW 6 EXCERPTS
    HIGHLY INFLUENTIAL

    Detection and correction of silent data corruption for large-scale high-performance computing

    • David Fiala
    • Computer Science
    • 2012 International Conference for High Performance Computing, Networking, Storage and Analysis
    • 2011
    VIEW 6 EXCERPTS
    HIGHLY INFLUENTIAL

    High Performance Dense Linear System Solver with Soft Error Resilience

    VIEW 8 EXCERPTS
    HIGHLY INFLUENTIAL

    Architecture Design for Soft Errors

    VIEW 3 EXCERPTS
    HIGHLY INFLUENTIAL

    QEMU, a Fast and Portable Dynamic Translator

    • Fabrice Bellard
    • Computer Science
    • USENIX Annual Technical Conference, FREENIX Track
    • 2005
    VIEW 3 EXCERPTS
    HIGHLY INFLUENTIAL