Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources

@article{Chen2006AlgorithmbasedCF,
  title={Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources},
  author={Zizhong Chen and Jack J. Dongarra},
  journal={Proceedings 20th IEEE International Parallel & Distributed Processing Symposium},
  year={2006},
  pages={10 pp.-}
}
As the size of today's high performance computers increases from hundreds, to thousands, and even tens of thousands of processors, node failures in these computers are becoming frequent events. Although checkpoint/rollback-recovery is the typical technique to tolerate such failures, it often introduces a considerable overhead. Algorithm-based fault tolerance is a very cost-effective method to incorporate fault tolerance into matrix computations. However, previous algorithm-based fault tolerance… CONTINUE READING
Highly Cited
This paper has 86 citations. REVIEW CITATIONS

Topics

Statistics

01020'06'07'08'09'10'11'12'13'14'15'16'17'18
Citations per Year

86 Citations

Semantic Scholar estimates that this publication has 86 citations based on the available data.

See our FAQ for additional information.