Learn More
This paper examines methods of approximating the optimum checkpoint restart strategy for minimizing application run time on a system exhibiting Poisson single component failures. Two different models will be developed and compared. We will begin with a simplified cost function that yields a first-order model. Then we will derive a more complete cost(More)
As computational clusters rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved computational performance over predecessor systems. One commonly used mechanism for providing application fault tolerance in(More)
While measures such as raw compute performance and system capacity continue to be important factors for evaluating cluster performance, such issues as system reliability and application resilience have become increasingly important as cluster sizes rapidly grow. Although efforts to directly improve fault-tolerance are important, it is also essential to(More)
As computational cluster computers rapidly grow in both size and complexity, system reliability and, in particular, application resilience have become increasingly important factors to consider in maintaining efficiency and providing improved compute performance over predecessor systems. One commonly used mechanism for providing application fault tolerance(More)
The massive scale of current and next-generation massively parallel processing (MPP) systems presents significant challenges related to fault tolerance. For applications that perform periodic checkpoints, the choice of the checkpoint interval, the period between checkpoints, can have a significant impact on the execution time of the application. Finding the(More)
The large scale of current and next-generation massively parallel processing (MPP) systems presents significant challenges related to fault tolerance. For applications that perform periodic checkpointing, the choice of the checkpoint interval, the period between checkpoints, can have a significant impact on the execution time of the application and the(More)
Unfortunately our world does not always reward us when we expect it, and we must learn to deal with nonreward. How do these experiences influence our behaviors and how can we use them to help us? InFrustration Theory: An Analysis of Dispositional Learning and Memory (1992), Abram Amsel has answered these questions; he has summarized over 40 years of(More)
  • 1