Increasing Reliability through Dynamic Virtual Clustering

  title={Increasing Reliability through Dynamic Virtual Clustering},
  author={Wesley Emeneker and D. Stanzione and Ira A. Fulton},
In a scientific community that increasingly relies upon High Performance Computing (HPC) for large scale simulations and analysis, the reliability of hardware and applications devoted to HPC is extremely important. While hardware reliability is not likely to dramatically increase in the coming years, software must be able to provide the reliability required by demanding applications. One way to increase the reliability of HPC systems is to use checkpointing to save the state of an application… CONTINUE READING


Publications referenced by this paper.
Showing 1-9 of 9 references

The Design and Implementation of Berkeley Lab’s

J. Duell, P. Hargrove, E Roman
Linux Checkpoint/Restart, • 2003
View 5 Excerpts
Highly Influenced

Zandy and Barton P . Miller . Reliable network connections

C. Victor

Similar Papers

Loading similar papers…