Kurt B. Ferreira

Learn More
As high-end computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are increasingly problematic at these scales due to excessive overheads predicted to more than double an application's time to solution.(More)
Faults have become the norm rather than the exception for high-end computing on clusters with 10s/100s of thousands of cores. Exacerbating this situation, some of these faults will not be detected, manifesting themselves as silent errors that will corrupt memory while applications continue to operate and report incorrect results. This paper introduces(More)
Operating system noise has been shown to be a key limiter of application scalability in high-end systems. While several studies have attempted to quantify the sources and effects of system interference using user-level mechanisms, there are few published studies on the effect of different kinds of kernel-generated noise on application performance at scale.(More)
Several recent publications have shown that hardware faults in the memory subsystem are commonplace. These faults are predicted to become more frequent in future systems that contain orders of magnitude more DRAM and SRAM than found in current memory subsystems. These memory subsystems will need to provide resilience techniques to tolerate these faults when(More)
Today's largest High Performance Computing (HPC) systems exceed one Petaflops (1015 floating point operations per second) and exascale systems are projected within seven years. But reliability is becoming one of the major challenges faced by exascale computing. With billion-core parallelism, the mean time to failure is projected to be in the range of(More)
interface. The Failable interface has methods for marking, unmarking, and checking whether the object’s data are allowed to experience bit flips. FTGMRES mark failability of the relevant objects on entry to the inner solver, and unmarks them on exit. We describe below how we implemented this high-level interface using the low-level application / OS(More)
In the early 1990s, researchers at Sandia National Laboratories and the University of New Mexico began development of customized system software for massively parallel “capability” computing platforms. These lightweight kernels have proven to be essential for delivering the full power of the underlying hardware to applications. This claim is underscored by(More)
Proposed exascale systems will present a number of considerable resiliency challenges. In particular, DRAM soft-errors, or bit-flips, are expected to greatly increase due to the increased memory density of these systems. Current hardware-based fault-tolerance methods will be unsuitable for addressing the expected soft error frequency rate. As a result,(More)
Current fault tolerance protocols are not sufficiently scalable for the exascale era. The most-widely used method, coordinated checkpointing, places enormous demands on the I/O subsystem and imposes frequent synchronizations. Uncoordinated protocols use message logging which introduces message rate limitations or undesired memory and storage requirements to(More)
Concern is beginning to grow in the high-performance computing (HPC) community regarding the reliability guarantees of future large-scale systems. Disk-based coordinated checkpoint/restart has been the dominant fault tolerance mechanism in HPC systems for the last 30 years. Checkpoint performance is so fundamental to scalability that nearly all capability(More)