Application Resilience: Making Progress in Spite of Failure

@article{Jones2008ApplicationRM,
  title={Application Resilience: Making Progress in Spite of Failure},
  author={William M. Jones and John T. Daly and Nathan DeBardeleben},
  journal={2008 Eighth IEEE International Symposium on Cluster Computing and the Grid (CCGRID)},
  year={2008},
  pages={789-794}
}
While measures such as raw compute performance and system capacity continue to be important factors for evaluating cluster performance, such issues as system reliability and application resilience have become increasingly important as cluster sizes rapidly grow. Although efforts to directly improve fault-tolerance are important, it is also essential to accept that application failures will inevitably occur and to ensure that progress is made despite these failures. Application monitoring… CONTINUE READING

From This Paper

Figures, tables, and topics from this paper.

References

Publications referenced by this paper.
Showing 1-10 of 13 references

Facilitating high-throughput asc calculations

J. T. Daly
Nuclear Weapons Highlights. Los Alamos National Laboratory (LALP-07-041), 2007, pp. 202–203. • 2007
View 1 Excerpt

Performance under failures of high-end computing

Proceedings of the 2007 ACM/IEEE Conference on Supercomputing (SC '07) • 2007
View 1 Excerpt

Methodology and metrics for quantifying application throughput

——
Proceedings of the Nuclear Explosives Code Developers Conference, 2006. • 2006
View 1 Excerpt

The evolution of the linux-ha project

——
UKUUG LISA/Winter Conference High-Availability and Reliability, 2004. • 2004
View 1 Excerpt