Learn More
Investigating parallel application performance at scale is an important part of high-performance computing (HPC) application development. The Extreme-scale Simulator (xSim) is a performance toolkit that permits running an application in a controlled environment at extreme scale without the need for a respective extreme-scale HPC system. Using a lightweight(More)
We present a monitoring system for large-scale parallel and distributed computing environments that allows to trade-off accuracy in a tunable fashion to gain scalability without compromising fidelity. The approach relies on classifying each gathered monitoring metric based on individual needs and on aggregating messages containing classes of individual(More)
An emerging aspect of high-performance computing (HPC) hardware/software co-design is investigating performance under failure. The work in this paper extends the Extreme-scale Simulator (xSim), which was designed for evaluating the performance of message passing interface (MPI) applications on future HPC architectures, with fault-tolerant MPI extensions(More)
The high performance computing (HPC) community is working to address fault tolerance and resilience concerns for current and future large scale computing platforms. This is driving enhancements in the programming environments , specifically research on enhancing message passing libraries to support fault tolerant computing capabilities. The community has(More)
As multi-petascale and exa-scale high-performance computing (HPC) systems inevitably have to deal with a number of resilience challenges, such as a significant growth in component count and smaller circuit sizes with lower circuit voltages, redundancy may offer an acceptable level of resilience that traditional fault tolerance techniques, such as(More)
The design and implementation of new run-time systems for high-performance computing (HPC) is usually tailored to the implementation of a programming language or a tool. Unfortunately, this leads to obvious problems since the design and implementation of new run-time systems is expensive and complex, resulting in effort duplication for all HPC run-time(More)
A team comprised of researchers from ORNL, NICS, and JICS/UTK are developing a code, General Astrophysical Simulation System (GenASiS), for multi-scale, multi-physics simulations of core-collapse supernovae on leadership computing facility architectures. In their paper, GenASiS: General Astrophysical Simulation System. I. Refinable Mesh And Nonrelativistic(More)
—The high-performance computing (HPC) community continues to increase the size and complexity of hardware platforms that support advanced scientific workloads. The run-time environment (RTE) is a crucial layer in the software stack for these large-scale systems. The RTE manages the interface between the operating system and the application running in(More)
—Exascale targeted scientific applications must be prepared for a highly concurrent computing environment where failure will be a regular event during execution. Natural and algorithm-based fault tolerance (ABFT) techniques can often manage failures more efficiently than traditional check-point/restart techniques alone. Central to many petascale(More)
Acknowledgments I will use this place to express my gratefulness to all the people who helped me to make all of this happen. At first there are my parents, without their help I would never have come so far. Then I will say thank you to all my colleagues at Computer and Mathematics Division at the Oak Ridge National Laboratory, especially to Dr. Christian(More)
  • 1