As supercomputers are entering an era of massive parallelism where the frequency of faults is increasing, the MPI Standard remains distressingly vague on the consequence of failures on MPI… (More)
In a constant effort to deliver steady performance improvements, the size of High Performance Computing (HPC) systems, as observed by the Top 500 ranking, has grown tremendously over the last decade.… (More)
As high-performance computing (HPC) systems increase in size and complexity they become more difficult to manage. The enormous component counts associated with these large systems lead to significant… (More)
Most predictions of Exascale machines picture billion way parallelism, encompassing not only millions of cores, but also tens of thousands of nodes. Even considering extremely optimistic advances in… (More)
The open source cluster application resources (OSCAR) toolkit is used to build and maintain HPC clusters. The OSCAR cluster installer provides a graphical user interface (GUI) "wizard" that directs… (More)
User-level failure mitigation (ULFM) is becoming the front-running solution for process fault tolerance in MPI. While not yet adopted into the MPI standard, it is being used by applications and… (More)
MPI includes all processes in MPI_COMM_WORLD; this is untenable for reasons of scale, resiliency, and overhead. This paper offers a new approach, extending MPI with a new concept called Sessions,… (More)
This paper provides an in-depth analysis of the software overheads in the MPI performance-critical path and exposes mandatory performance overheads that are unavoidable based on the MPI-3.1… (More)
Distributed and parallel systems are typically managed with “static” settings: the operating system (OS) and the runtime environment (RTE) are specified at a given time and cannot be changed to fit… (More)