High-Performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-Depth performance Analysis

  title={High-Performance and Scalable MPI over InfiniBand with Reduced Memory Usage: An In-Depth performance Analysis},
  author={Sayantan Sur and Matthew J. Koop and Dhabaleswar K. Panda},
  journal={ACM/IEEE SC 2006 Conference (SC'06)},
InfiniBand is an emerging HPC interconnect being deployed in very large scale clusters, with even larger InfiniBand-based clusters expected to be deployed in the near future. The message passing interface (MPI) is the programming model of choice for scientific applications running on these large scale clusters. Thus, it is very critical for the MPI implementation used to be based on a scalable and high-performance design. We analyze the performance and scalability aspects of MVAPICH, a popular… 

High performance MPI design using unreliable datagram for ultra-scale InfiniBand clusters

This is the first research work that presents a high-performance MPI design over InfiniBand that is completely based on UD and can achieve near identical or better application performance than RC.

Scalable and high-performance mpi design for very large infiniband clusters

This dissertation presents novel designs based on the new features offered by InfiniBand, in order to design scalable and high-performance MPI libraries for large-scale clusters with tens-of-thousands of nodes.

High-Performance Multi-Transport MPI Design for Ultra-Scale InfiniBand Clusters

This dissertation explores the different transports provided by InfiniBand to determine the scalabilty and performance aspects of each and proposes and implements new MPI designs for transports that have never been used for MPI in the past.

Adaptive Receiver Window Scaling: Minimizing MPI Communication Memory over InfiniBand

This paper presents a mechanism which dynamically adapts resource consumption according to application runtime characteristics enabling the MPI layer to consume only minimum amounts of resources, and provides analysis of of the proposed design in combination with the effect of low-level InfiniBand flow-control timers on end-application memory usage.

Scalable Collective Communication for Next-Generation Multicore Clusters with InfiniBand

The utility of shared memory to enhance several aspects relating to performance and resource consumption is studied and the impact of cutting down of network transactions for important collective operations such as MPI Barrier and MPI Allreduce is studied.

Scalable High Performance Message Passing over InfiniBand for Open MPI

This paper uses the software reliability capabilities of Open MPI to provide the guaranteed delivery semantics required by MPI, and shows that UD not only requires fewer resources at scale, but also allows for shorter MPI startup times.

Memory Footprint of Locality Information on Many-Core Platforms

  • Brice Goglin
  • Computer Science
    2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • 2018
This article analysis of the physical and virtual memories in supercomputing architectures shows that this shared region can be mapped at the same virtual address in all processes, hence dramatically simplifying the software implementation of MPI implementations.

SR-IOV Support for Virtualization on InfiniBand Clusters: Early Experience

Experimental evaluations show that for the performance of MPI and PGAS point-to-point communication benchmarks over SR-IOV with InfiniBand is comparable to that of the native Infini band hardware, for most message lengths, but it is observed that the performanceof MPI collective operations overSR-Iov with InfinisBand is inferior to native (non-virtualized) mode.

MPC-MPI: An MPI Implementation Reducing the Overall Memory Consumption

This article proposes an MPI implementation built upon the MPC framework called MPC-MPI reducing the overall memory footprint and obtained up to 47% of memory gain on benchmarks and a real-world application.

ZedWulf-A Zynq SoC cluster for Energy-Efficient Acceleration of Graph Problems

  • Computer Science
  • 2015
This work builds a cluster composed of 32 Zynq-based devices (ZedWulf) to accelerate sparsegraph problems, which are typically memory-bottlenecked on traditional x86 systems and forms a performance model with a coefficient of determination more than 90% for understanding runtime scaling trends for the authors' novel scatter-gather routine.



Shared receive queue based scalable MPI design for InfiniBand clusters

This paper proposes a novel MPI design which efficiently utilizes SRQs and provides very good performance, and reveals that the proposed designs take only 1/10th the memory requirement as compared to the original design on a cluster sized at 16,000 nodes.

Infiniband scalability in Open MPI

Open MPI, a new open source implementation of the MPI standard targeted for production computing, provides several mechanisms to enhance Infiniband scalability, and initial comparisons with MVAPICH show similar performance but with much better scalability characteristics.

Performance Comparison of MPI Implementations over InfiniBand, Myrinet and Quadrics

The results show that the three MPI implementations all have their advantages and disadvantages, and InfiniBand can offer significant performance improvements for a number of applications compared with Myrinet and Quadrics when using the PCI-X bus.

High Performance RDMA-Based MPI Implementation over InfiniBand

A new design of MPI over InfiniBand is proposed which brings the benefit of RDMA to not only large messages, but also small and control messages and achieves better scalability by exploiting application communication pattern and combining send/receive operations with RDMA operations.

Adaptive connection management for scalable MPI over InfiniBand

  • Weikuan YuQi GaoD. Panda
  • Computer Science
    Proceedings 20th IEEE International Parallel & Distributed Processing Symposium
  • 2006
This paper proposes adaptive connection management (ACM) to dynamically control the establishment of InfiniBand reliable connections (RC) based on the communication frequency between MPI processes and experimental results indicate that ACM algorithms can benefit parallel programs in terms of the process initiation time, the number of active connections, and the resource usage.

Open MPI: Goals, Concept, and Design of a Next Generation MPI Implementation

Open MPI provides a unique combination of novel features previously unavailable in an open-source, production-quality implementation of MPI, which provides both a stable platform for third-party research as well as enabling the run-time composition of independent software add-ons.

RDMA read based rendezvous protocol for MPI over InfiniBand: design alternatives and benefits

This paper proposes several mechanisms to exploit RDMA Read and selective interrupt based asynchronous progress to provide better computation/communication overlap on InfiniBand clusters and indicates that the designs have a strong positive impact on scalability of parallel applications.

Analyzing Ultra-Scale Application Communication Requirements for a Reconfigurable Hybrid Interconnect

Overall results show that HFAST is a promising approach for practically addressing the interconnect requirements of future peta-scale systems.

The Nas Parallel Benchmarks

A new set of benchmarks has been developed for the performance evaluation of highly parallel supercom puters that mimic the computation and data move ment characteristics of large-scale computational fluid dynamics applications.