Learn More
As high-end computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are increasingly problematic at these scales due to excessive overheads predicted to more than double an application's time to solution.(More)
—The challenge of balancing between power and performance is now well established. While research in this area is well underway, the ability to measure power and energy in situ has remained an obstacle. This problem is magnified in the field of High Performance Computing (HPC). To meet this challenge, a device called PowerInsight has been designed to(More)
—Power has recently been recognized as one of the major obstacles in fielding a Peta-FLOPs class system. To reach Exa-FLOPs, the challenge will certainly be compounded. In this paper we will discuss a number of High Performance Computing power related topics. We first describe our implementation of a scalable power measurement framework that has enabled us(More)
— This paper will summarize an IO 1 performance analysis effort performed on Sandia National Laboratories Red Storm platform. Our goal was to examine the IO system performance and identify problems or bottlenecks in any aspect of the IO subsystem. Our process examined the entire IO path from application to disk both in segments and as a whole. Our final(More)
—As part counts in high performance computing systems are projected to increase faster than part reliabilities, there is increasing interest in enabling jobs to continue to execute in the presence of failures. Process replication has been shown to be a viable method to accomplish this, but previous studies have focussed on full replication levels (dual,(More)
—Historically, scientific computing applications have been statically linked before running on massively parallel High Performance Computing (HPC) platforms. In recent years, demand for supporting dynamically linked applications at large scale has increased. When programs running at large scale dynamically load shared objects, they often request the same(More)