Learn More
Collective communication operations, used by many scientific applications, tend to limit overall parallel application performance and scalability. Computer systems are becoming more heterogeneous with increasing node and core-per-node counts. Also, a growing number of data-access mechanisms, of varying characteristics, are supported within a single computer(More)
This paper explores the computation and communication overlap capabilities enabled by the new CORE-Direct hardware capabilities introduced in the InfiniBand Network Interface Card (NIC) ConnectX-2. We use the latency dominated nonblocking barrier algorithm in this study, and find that at 64 process count, a contiguous time slot of about 80% of the(More)
This paper introduces the newly developed Infini- Band (IB) Management Queue capability, used by the Host Channel Adapter (HCA) to manage network task data flow dependancies, and progress the communications associated with such flows. These tasks include sends, receives, and the newly supported wait task, and are scheduled by the HCA based on a data(More)
High performance computing (HPC) has begun scaling beyond the Petaflop range towards the Exaflop (1000 Petaflops) mark. One of the major concerns throughout the development toward such performance capability is scalability—both at the system level and the application layer. In this paper we present a novel approach for a new design concept—the co-design(More)
This paper describes the design and implementation of InfiniBand (IB) {CORE-textit{Direct}} based blocking and nonblocking broadcast operations within the Cheetah collective operation framework. It describes a novel approach that fully offloads collective operations and employs only user-supplied buffers. For a 64 rank communicator, the latency of(More)
OpenSHMEM is an effort to standardize the well-known SHMEM parallel programming library. The project aims to produce an open-source and portable SHMEM API and is led by ORNL and UH. In this paper, we optimize the current OpenSHMEM reference implementation , based on GASNet, to achieve higher performance characteristics. To achieve these desired performance(More)
To improve the scalability of InfiniBand on large scale clusters Open MPI introduced a protocol known as B-SRQ [2]. This protocol was shown to provide much better memory utilization of send and receive buffers for a wide variety of benchmarks and real-world applications. Unfortunately B-SRQ increases the number of connections between communicating peers.(More)
Clusters featuring the InfiniBand interconnect are continuing to scale. As an example, the “Ranger” system at the Texas Advanced Computing Center (TACC) includes over 60,000 cores with nearly 4,000 InfiniBand ports. The latest Top500 list shows 30% of systems and over 50% of the top 100 are now using InfiniBand as the compute node(More)