Fault prediction under the microscope: A closer look into HPC systems

  title={Fault prediction under the microscope: A closer look into HPC systems},
  author={Ana Gainaru and Franck Cappello and Marc Snir and William Kramer},
  journal={2012 International Conference for High Performance Computing, Networking, Storage and Analysis},
A large percentage of computing capacity in today's large high-performance computing systems is wasted because of failures. Consequently current research is focusing on providing fault tolerance strategies that aim to minimize fault's effects on applications. By far the most popular technique is the checkpoint-restart strategy. A complement to this classical approach is failure avoidance, by which the occurrence of a fault is predicted and preventive measures are taken. This requires a reliable… CONTINUE READING
Highly Cited
This paper has 109 citations. REVIEW CITATIONS

14 Figures & Tables

Extracted Numerical Results

  • However, for the other 25% that propagate, a wrong prediction will lead to a decrease in both precision and recall.
  • Our previous work showed 43% recall and 93% precision for the LANL system by using a purely signal analysis approach.
  • When running our method without checking the location, we obtain a precision of around 94%.



Citations per Year

109 Citations

Semantic Scholar estimates that this publication has 109 citations based on the available data.

See our FAQ for additional information.