Swen Böhm

Learn More
We present a monitoring system for large-scale parallel and distributed computing environments that allows to trade-off accuracy in a tunable fashion to gain scalability without compromising fidelity. The approach relies on classifying each gathered monitoring metric based on individual needs and on aggregating messages containing classes of individual(More)
—An emerging aspect of high-performance computing (HPC) hardware/software co-design is investigating performance under failure. The work in this paper extends the Extreme-scale Simulator (xSim), which was designed for evaluating the performance of message passing interface (MPI) applications on future HPC architectures, with fault-tolerant MPI extensions(More)
The high performance computing (HPC) community is working to address fault tolerance and resilience concerns for current and future large scale computing platforms. This is driving enhancements in the programming environments , specifically research on enhancing message passing libraries to support fault tolerant computing capabilities. The community has(More)
The design and implementation of new run-time systems for high-performance computing (HPC) is usually tailored to the implementation of a programming language or a tool. Unfortunately, this leads to obvious problems since the design and implementation of new run-time systems is expensive and complex, resulting in effort duplication for all HPC run-time(More)
—The high-performance computing (HPC) community continues to increase the size and complexity of hardware platforms that support advanced scientific workloads. The run-time environment (RTE) is a crucial layer in the software stack for these large-scale systems. The RTE manages the interface between the operating system and the application running in(More)
  • 1