Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library

  title={Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library},
  author={Alexander Matthes and Ren{\'e} Widera and Erik Zenker and Benjamin Worpitz and Axel Huebl and Michael Bussmann},
We present an analysis on optimizing performance of a single C++11 source code using the Alpaka hardware abstraction library. For this we use the general matrix multiplication (GEMM) algorithm in order to show that compilers can optimize Alpaka code effectively when tuning key parameters of the algorithm. We do not intend to rival existing, highly optimized DGEMM versions, but merely choose this example to prove that Alpaka allows for platform-specific tuning with a single source code. In… 
Challenges Porting a C++ Template-Metaprogramming Abstraction Layer to Directive-based Offloading
This work presents its approach of porting the GPU-accelerated particle-in-cell code PIConGPU to OpenACC and OpenMP target by adding two new backends to its existing C++-template metaprogramming-based offloading abstraction layer alpaka and avoiding other modifications to the application code.
Portability: A Necessary Approach for Future Scientific Software
It would be easier if researchers could develop scientific software and then could execute it on many different hardware combinations without having to rewrite the code over and over again, according to this white paper.
High Performance Implementation of Boris Particle Pusher on DPC++. A First Look at oneAPI
This paper shows how to adapt the C++ implementation of the particle push algorithm from the Hi-Chi project to the DPC++ programming language and reports on the performance of the code on high-end Intel CPUs and Intel GPUs.
HETSIM: Simulating Large-Scale Heterogeneous Systems using a Trace-driven, Synchronization and Dependency-Aware Framework
HETSIM is implemented, a trace-driven, synchronization and dependency-aware framework for fast and accurate pre-silicon performance and power estimations for heterogeneous systems with up to thousands of cores, and demonstrated through design-space exploration on two recent target architectures.
Metrics and Design of an Instruction Roofline Model for AMD GPUs
This article designs an instruction roofline model for AMD GPUs using AMD's ROCProfiler and a benchmarking tool, BabelStream, as a way to measure an application’s performance in instructions and memory transactions on new AMD hardware.
Transmuter: Bridging the Efficiency Gap using Memory and Dataflow Reconfiguration
A flexible accelerator called Transmuter is presented, in a novel effort to bridge the gap between General-Purpose Processors (GPPs) and Application-Specific Integrated Circuits (ASICs), which addresses a rapidly growing set of algorithms exhibiting dynamic data movement patterns, irregularity, and sparsity, while delivering GPU-like efficiencies for traditional dense applications.
Compiler-Level Matrix Multiplication Optimization for Deep Learning
Two novel algorithms for GEMM optimization based on the TVM framework are proposed, a lightweight Greedy Best First Search (G-BFS) method based on heuristic search, and a Neighborhood Actor Advantage Critic (N-A2C) methodbased on reinforcement learning, which show significant performance improvement.
Evaluation of performance portability frameworks for the implementation of a particle‐in‐cell code
An in‐depth evaluation of the performance portability frameworks Kokkos and RAJA with respect to their suitability for the implementation of complex particle‐in‐cell (PIC) simulation codes concludes that the Kokkos framework would be suited best to tackle the massively parallel implementation of the full PIC model.
DASH: Distributed Data Structures and Parallel Algorithms in a Global Address Space
Recent developments in the context of DASH concerning the ability to execute tasks with remote dependencies, the exploitation of dynamic hardware locality, smart data structures, and advanced algorithms are described.
Quantum ESPRESSO toward the exascale.
A motivation and brief review of the ongoing effort to port Quantum ESPRESSO onto heterogeneous architectures based on hardware accelerators, which will overcome the energy constraints that are currently hindering the way toward exascale computing are presented.


Alpaka -- An Abstraction Library for Parallel Kernel Acceleration
The Alpaka library defines and implements an abstract hierarchical redundant parallelism model that allows to achieve platform and performance portability across various types of accelerators by ignoring specific unsupported levels and utilizing only the ones supported on a specific accelerator.
Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond
This work demonstrates how the CUDA-based open-source plasma simulation code PIConGPU can benefit from the tunable kernel execution strategies of the Alpaka library, achieving portability and performance with single-source kernels on conventional CPUs, Power8 CPUs and NVIDIA GPUs.
Kokkos: Enabling Performance Portability Across Manycore Architectures
The Kokkos C++ library is developed to provide scientific and engineering codes with a user accessible many core performance portable programming model and enables users' code to satisfy multiple architecture specific memory access pattern performance constraints without having to modify their source code.
An optimized large-scale hybrid DGEMM design for CPUs and ATI GPUs
This paper investigates advanced software-pipelining optimizations for the double-precision general matrix multiplication (DGEMM) algorithm running on a heterogeneous system that includes ATI GPUs and results show that resource contention on the PCIe bus and on the host memory are limiting factors.
OpenMP: an industry standard API for shared-memory programming
At its most elemental level, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran (and separately, C and C++ to express shared memory parallelism. It
Intel Xeon Phi Processor High Performance Programming: Knights Landing Edition 2nd Edition
This book is an all-in-one source of information for programming the Second-Generation Intel Xeon Phi product family also called Knights Landing. The authors provide detailed and timely Knights
The OpenCL specification
  • A. Munshi
  • Computer Science
    2009 IEEE Hot Chips 21 Symposium (HCS)
  • 2009
The specification is divided into a core specification that any OpenCL compliant implementation must support; a handheld/embedded profile which relaxes the OpenCL compliance requirements for handheld and embedded devices; and a set of optional extensions that are likely to move into the core specification in later revisions of the Opencl specification.
PIConGPU: A Fully Relativistic Particle-in-Cell Code for a GPU Cluster
The simulation code PIConGPU presented in this paper is, to the authors' knowledge, the first scalable GPU cluster implementation of the PIC algorithm in plasma physics.
Radiative signature of the relativistic Kelvin-Helmholtz Instability
  • M. Bussmann, H. Burau, R. Widera
  • Physics
    2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
  • 2013
We present a particle-in-cell simulation of the relativistic Kelvin-Helmholtz Instability (KHI) that for the first time delivers angularly resolved radiation spectra of the particle dynamics during