The Role of Idle Waves, Desynchronization, and Bottleneck Evasion in the Performance of Parallel Programs

  title={The Role of Idle Waves, Desynchronization, and Bottleneck Evasion in the Performance of Parallel Programs},
  author={Ayesha Afzal and Georg Hager and Gerhard Wellein},
—The performance of highly parallel applications on distributed-memory systems is influenced by many factors. Analytic performance modeling techniques aim to provide insight into performance limitations and are often the starting point of optimization efforts. However, coupling analytic models across the system hierarchy (socket, node, network) fails to encompass the intricate interplay between the program code and the hardware, especially when execution and communication bottlenecks are… 
Exploring Techniques for the Analysis of Spontaneous Asynchronicity in MPI-Parallel Applications
It is shown how desynchronization patterns can be readily identified from a data set that is much smaller than a full MPI trace, and lead the way towards a more general classi fication of parallel program dynamics.


Analytic Modeling of Idle Waves in Parallel Programs: Communication, Cluster Topology, and Noise Impact
A validated analytic model for the propagation mechanisms of idle waves across the ranks of MPI-parallel programs and an analytic expression for idle wave decay rate with respect to noise power is derived.
Delay Flow Mechanisms on Clusters
Synthetic microbenchmarks are used to highlight three effects that are of importance in this context: propagation of long-term delays, noise-assisted decay of propagating delays, and noise-induced desynchronization of memory-bound applications.
Propagation and Decay of Injected One-Off Delays on Clusters: A Case Study
This work investigates various traces collected from synthetic benchmarks that mimic real applications on simulated and real message-passing systems in order to pin-point the mechanisms behind delay propagation, and studies the dependence of the propagation speed of “idle waves,” i.e., propagating phases of inactivity emanating from injected delays with respect to the execution and communication properties of the application.
Identifying the Root Causes of Wait States in Large-Scale Parallel Applications
By replaying event traces in parallel both in forward and backward direction, this work can identify the processes and call paths responsible for the most severe imbalances even for runs with tens of thousands of processes.
Analytic performance model for parallel overlapping memory‐bound kernels
A performance model of execution of memory‐bound loop kernels that can predict the memory bandwidth share per kernel on a memory contention domain depending on the number of active cores and which other workload the kernel is paired with is constructed.
Desynchronization and Speedup in an Asynchronous Conservative Parallel Update Protocol
This chapter gives anoverview of how the methods of non-equilibrium surface growth (physics of complex systems) can be applied to uncover some properties of state update algorithms used in distributed parallel discrete-event simulations (PDES) and shows that the conservative PDES are generally scalable in the ring communication topology.
System noise, OS clock ticks, and fine-grained parallel applications
This work identifies a major source of noise to be indirect overhead of periodic OS clock interrupts ("ticks"), that are used by all general-purpose OSs as a means of maintaining control, and suggests replacing ticks with an alternative mechanism the authors call "smart timers".
Idle waves in high-performance computing.
This study provides a description of the large number of processes in parallel scientific applications as a continuous medium and identifies the propagation of idle waves through processes in scientific applications with a local information exchange between the two processes.
The Influence of Operating Systems on the Performance of Collective Operations at Extreme Scale
It is demonstrated that synchronizing the noise can significantly reduce its negative influence, and on extreme-scale platforms, the performance is correlated with the largest interruption to the application, even if the probability of such an interruption is extremely small.
Introduction to High Performance Computing for Scientists and Engineers
  • G. HagerG. Wellein
  • Computer Science
    Chapman and Hall / CRC computational science series
  • 2011
The authors show how to avoid or ameliorate typical performance problems connected with OpenMP, and present cache-coherent nonuniform memory access (ccNUMA) optimization techniques, examine distributed-memory parallel programming with message passing interface (MPI), and explain how to write efficient MPI code.