Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing

@article{Chen2009HighlySS,
  title={Highly Scalable Self-Healing Algorithms for High Performance Scientific Computing},
  author={Zizhong Chen and Jack J. Dongarra},
  journal={IEEE Transactions on Computers},
  year={2009},
  volume={58},
  pages={1512-1524}
}
As the number of processors in today's high-performance computers continues to grow, the mean-time-to-failure of these computers is becoming significantly shorter than the execution time of many current high-performance computing applications. Although today's architectures are usually robust enough to survive node failures without suffering complete system failure, most of today's high-performance computing applications cannot survive node failures. Therefore, whenever a node fails, all… CONTINUE READING
Highly Cited
This paper has 34 citations. REVIEW CITATIONS