Learn More
New algorithms and optimization techniques are needed to balance the accelerating trend towards bandwidth-starved multicore chips. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in(More)
Task parallelism as employed by the OpenMP task construct or some Intel Threading Building Blocks (TBB) components, although ideal for tackling irregular problems or typical produc-er/consumer schemes, bears some potential for performance bottlenecks if locality of data access is important, which is typically the case for memory-bound code on ccNUMA(More)
We present a pipelined wavefront parallelization approach for stencil-based computations. Within a fixed spatial domain successive wavefronts are executed by threads scheduled to a multicore processor chip with a shared outer level cache. By re-using data from cache in the successive wavefronts this multicore-aware parallelization strategy employs temporal(More)
INTRODUCTION Many psychotropic drugs can delay cardiac repolarization and thereby prolong the rate-corrected QT interval (QTc). A prolonged QTc often arouses concern in clinical practice, as it can be followed, in rare cases, by the life-threatening polymorphic ventricular tachyarrhythmia called torsade de pointes (TdP). METHOD We searched PubMed for(More)
Algorithms with low computational intensity show interesting performance and power consumption behavior on multicore processors. We choose the lattice-Boltzmann method (LBM) as a prototype for this scenario in order to show if and how single-chip performance and power characteristics can be generalized to the highly parallel case. LBM is an algorithm for(More)
Bandwidth-starved multicore chips have become ubiquitous. It is well known that the performance of stencil codes can be improved by temporal blocking, lessening the pressure on the memory interface. We introduce a new pipelined approach that makes explicit use of shared caches in multicore environments and minimizes synchronization and boundary overhead.(More)
We present a simple library which equips MPI implementations with truly asynchronous non-blocking point-to-point operations, and which is independent of the underlying communication infrastructure. It utilizes the MPI profiling interface (PMPI) and the MPI_THREAD_MULTIPLE thread compatibility level, and works with current versions of Intel MPI, Open MPI,(More)
Today's High Performance Computing (HPC) clusters consist of hundreds of thousands of CPUs, memory units, complex networks, and other components. Such an extreme level of hardware parallelism reduces the mean time to failure (MTTF) of the overall cluster. The future of HPC urgently demands to develop environments that facilitate programs to run successfully(More)
The lattice-Boltzmann method (LBM) is an algorithm for CFD simulations that has gained popularity due to its ease of implementation and suitability for complex geometries. Its scalability on mul-ticore chips is often limited due to its low computational intensity , leading to interesting characteristics regarding optimal performance and energy to solution(More)