Learn More
This paper describes the evolution of the Portals message passing architecture and programming interface from its initial development on tightly-coupled massively parallel platforms to the current implementation running on a 1792-node commodity PC Linux cluster. Portals provides the basic building blocks needed for higher-level protocols to implement(More)
This paper describes a portable benchmark suite that assesses the ability of cluster networking hardware and software to overlap MPI communication and computation. The Communication Offload MPI-based Benchmark , or COMB, uses two different methods to characterize the ability of messages to make progress concurrently with computational processing on the host(More)
In this paper we demonstrate that the placement of func-tionality can have a significant impact on the performance of applications. OS bypass distributes OS policies to the network interface and protocol processing to the application to enhance application performance. We take this notion one step further and consider the distribution of application(More)
This paper describes a portable benchmark suite that assesses the ability of cluster networking hardware and software to overlap MPI communication and computation. The Communication Offload MPI-based Benchmark, or COMB, uses two different methods to characterize the ability of messages to make progress concurrently with computational processing on the host(More)
This paper describes how a portable benchmark suite that measures the ability of an MPI implementation to overlap computation and communication can be used to discover and diagnose performance problems. We describe the approach of the benchmark suite and discuss a performance problem that we uncovered with the MPI implementation on the ASCI/Red(More)
Latency and bandwidth are usually considered to be the dominant factor in parallel application performance; however, recent studies have indicated that support for independent progress in MPI can also have a significant impact on application performance. This paper leverages the Cplant system at Sandia National Labs to compare a faster, vendor provided MPI(More)
  • 1