Learn More
—An emerging aspect of high-performance computing (HPC) hardware/software co-design is investigating performance under failure. The work in this paper extends the Extreme-scale Simulator (xSim), which was designed for evaluating the performance of message passing interface (MPI) applications on future HPC architectures, with fault-tolerant MPI extensions(More)
We present a monitoring system for large-scale parallel and distributed computing environments that allows to trade-off accuracy in a tunable fashion to gain scalability without compromising fidelity. The approach relies on classifying each gathered monitoring metric based on individual needs and on aggregating messages containing classes of individual(More)
The high performance computing (HPC) community is working to address fault tolerance and resilience concerns for current and future large scale computing platforms. This is driving enhancements in the programming environments , specifically research on enhancing message passing libraries to support fault tolerant computing capabilities. The community has(More)
The design and implementation of new run-time systems for high-performance computing (HPC) is usually tailored to the implementation of a programming language or a tool. Unfortunately, this leads to obvious problems since the design and implementation of new run-time systems is expensive and complex, resulting in effort duplication for all HPC run-time(More)
A team comprised of researchers from ORNL, NICS, and JICS/UTK are developing a code, General Astrophysical Simulation System (GenASiS), for multi-scale, multi-physics simulations of core-collapse supernovae on leadership computing facility architectures. In their paper, GenASiS: General Astrophysical Simulation System. I. Refinable Mesh And Nonrelativistic(More)
—The high-performance computing (HPC) community continues to increase the size and complexity of hardware platforms that support advanced scientific workloads. The run-time environment (RTE) is a crucial layer in the software stack for these large-scale systems. The RTE manages the interface between the operating system and the application running in(More)
Acknowledgments I will use this place to express my gratefulness to all the people who helped me to make all of this happen. At first there are my parents, without their help I would never have come so far. Then I will say thank you to all my colleagues at Computer and Mathematics Division at the Oak Ridge National Laboratory, especially to Dr. Christian(More)
  • 1