Proactive fault tolerance for HPC with Xen virtualization

  title={Proactive fault tolerance for HPC with Xen virtualization},
  author={Arun Babu Nagarajan and Frank Mueller and Christian Engelmann and Stephen L. Scott},
Large-scale parallel computing is relying increasingly on clusters with thousands of processors. At such large counts of compute nodes, faults are becoming common place. Current techniques to tolerate faults focus on reactive schemes to recover from faults and generally rely on a checkpoint/restart mechanism. Yet, in today's systems, node failures can often be anticipated by detecting a deteriorating health status. Instead of a reactive scheme for fault tolerance (FT), we are promoting a… CONTINUE READING
Highly Influential
This paper has highly influenced 17 other papers. REVIEW HIGHLY INFLUENTIAL CITATIONS
Highly Cited
This paper has 403 citations. REVIEW CITATIONS


Publications citing this paper.
Showing 1-10 of 248 extracted citations

404 Citations

Citations per Year
Semantic Scholar estimates that this publication has 404 citations based on the available data.

See our FAQ for additional information.


Publications referenced by this paper.
Showing 1-4 of 4 references

Similar Papers

Loading similar papers…