Learn More
This paper examines methods of approximating the optimum checkpoint restart strategy for minimizing application run time on a system exhibiting Poisson single component failures. Two different models will be developed and compared. We will begin with a simplified cost function that yields a first-order model. Then we will derive a more complete cost(More)
As computational clusters rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved computational performance over predecessor systems. One commonly used mechanism for providing application fault tolerance in(More)
As computational cluster computers rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved compute performance over predecessor systems. One commonly used mechanism for providing application fault tolerance(More)
—While measures such as raw compute performance and system capacity continue to be important factors for evaluating cluster performance, such issues as system reliability and application resilience have become increasingly important as cluster sizes rapidly grow. Although efforts to directly improve fault-tolerance are important, it is also essential to(More)
  • John Daly, Bill Harrod, Doe Sc, Thuc Hoang, Doe Nnsa, Lucy Nowell +32 others
  • 2012
The following report summarizes the proceedings of a three-and-a-half day inter-agency workshop focused on the technical challenges of HPC resilience in the 2020 Exascale timeframe. The resilience problem is not specific to any particular program or agency; coordinated resilience solutions will be challenging because of the need for a truly integrated(More)
  • 1