Learn More
This paper introduces Netgauge, an extensible open-source framework for implementing network benchmarks. The structure of Net-gauge abstracts and explicitly separates communication patterns from communication modules. As a result of this separation of concerns, new benchmark types and new network protocols can be added independently to Netgauge. We describe(More)
Designing a 2048 core high performance cluster, including an appropriate parallel storage complex and a high speed network, under the pressure of limited budget (2.6 Mio Euro), performance, thermal and space limitations is really a challenging task. In this paper, we present our design decisions and their reasons , our experiences during the installation(More)
The MPI Barrier() call can be crucial for several applications and has been target of different optimizations since several decades. The best solution to the barrier problem scales with O(log 2 N) and uses the dissemination principle. A new method using an enhanced dissemination principle and inherent network parallelism will be demonstrated in this paper.(More)
Accurate models of parallel computation are often crucial to optimize parallel algorithms for their running time. In general the easier the model's use and the smaller the number of parameters and interdependencies among them, the more inaccuracies are introduced by simplification. On the other hand a too complex model is unusable. We show that it is(More)
Large-scale parallel applications performing global synchronization may spend a significant amount of execution time waiting for the completion of a barrier operation. Consequently, numerous research works have focused on reducing the communication costs of synchronization primitives. However, so far there has been no exhaustive comparison of barrier(More)
There are several different algorithms available to perform a synchronization of multiple processors. Some of them support only shared memory architectures or very fine grained supercomputers. This work gives an overview about all currently known algorithms which are suitable for distributed shared memory architectures and message passing based computer(More)
The performance of the barrier operation can be crucial for many parallel codes. Especially distributed shared memory systems have to synchronize frequently to ensure the proper ordering of memory accesses. The barrier operation is often performed on top of point-to-point messages and the best algorithm scales with O(log 2 P · L) in the LogP model. We(More)
To leverage high speed interconnects like InfiniBand it is important to minimize the communication overhead. The most interfering overhead is the registration of communication memory. In this paper, we present our analysis of the memory registration process inside the Mellanox InfiniBand driver and possible ways out of this bottleneck. We evaluate and(More)
We present a micro benchmark suite to evaluate InfiniBandtrade implementations with regards to single message performance and the addressing of many hosts. We use a 1:n communication pattern to assess the latency and bandwidth for all different combinations of InfiniBandstrade transport services and functions. The results gathered in this study are used to(More)
This paper describes the basic concepts of our solution to improve the performance of Ethernet Communication on a Linux Cluster environment by introducing Reliable Low Latency Ethernet Sockets. We show that about 25% of the socket latency can be saved by using our simplified protocol. Especially, we put emphasis on demonstrating that this performance(More)