Metrics and Design of an Instruction Roofline Model for AMD GPUs

  title={Metrics and Design of an Instruction Roofline Model for AMD GPUs},
  author={Matthew Leinhauser and Ren{\'e} Widera and Sergei Bastrakov and Alexander Debus and Michael Bussmann and Sunita Chandrasekaran},
  journal={ACM Transactions on Parallel Computing},
  pages={1 - 14}
Due to the recent announcement of the Frontier supercomputer, many scientific application developers are working to make their applications compatible with AMD (CPU-GPU) architectures, which means moving away from the traditional CPU and NVIDIA-GPU systems. Due to the current limitations of profiling tools for AMD GPUs, this shift leaves a void in how to measure application performance on AMD GPUs. In this article, we design an instruction roofline model for AMD GPUs using AMD’s ROCProfiler and… 

Figures and Tables from this paper


Evaluating Performance Portability of OpenMP for SNAP on NVIDIA, Intel, and AMD GPUs Using the Roofline Methodology
In this paper, we show that OpenMP 4.5 based implementation of TestSNAP, a proxy-app for the Spectral Neighbor Analysis Potential (SNAP) in LAMMPS, can be ported across the NVIDIA, Intel, and AMD
An Instruction Roofline Model for GPUs
  • Nan Ding, Samuel Williams
  • Computer Science
    2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)
  • 2019
The Instruction Roofline incorporates instructions and memory transactions across all memory hierarchies together and provides more performance insights than the FLOP-oriented Roofline Model, i.e., instruction throughput, stride memory access patterns, bank conflicts, and thread predication.
A Quantitative Performance Evaluation of Fast on-Chip Memories of GPUs
  • E. Konstantinidis, Y. Cotronis
  • Computer Science
    2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)
  • 2016
A set of micro-benchmarks which aim to provide effective bandwidth performance measurements of the on-chip special memories of GPUs and validate the peak measurements on real world problems as provided by the polybench-gpu benchmark suite.
Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond
This work demonstrates how the CUDA-based open-source plasma simulation code PIConGPU can benefit from the tunable kernel execution strategies of the Alpaka library, achieving portability and performance with single-source kernels on conventional CPUs, Power8 CPUs and NVIDIA GPUs.
Performance Analysis of PIConGPU: Particle-in-Cell on GPUs using NVIDIA’s NSight Systems and NSight Compute
The PIConGPU team wanted to dive deep into the application to understand at the finest granularity, which portions of the code could be further optimized to exploit the hardware on Summit at it’s maximum potential and also to elucidate which key kernels should be tracked and optimized for the CAAR effort to port this code to Frontier.
GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models
Many scientific codes consist of memory bandwidth bound kernels — the dominating factor of the runtime is the speed at which data can be loaded from memory into the Arithmetic Logic Units, before
Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library
The general matrix multiplication (GEMM) algorithm is used in this example to prove that Alpaka allows for platform-specific tuning with a single source code and the optimization potential available with vendor-specific compilers when confronted with the heavily templated abstractions of AlPaka.
Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system
A methodology to construct a hierarchical Roofline on NVIDIA GPUs and extends it to support reduced precision and Tensor Cores and to analyze three proxy applications: GPP from BerkeleyGW, HPGMG from AMReX, and conv2d from TensorFlow.
Cache-aware Roofline model: Upgrading the loft
This paper analyzes the original Roofline model and proposes a novel approach to provide a more insightful performance modeling of modern architectures by introducing cache-awareness, thus significantly improving the guidelines for application optimization.
Circumventing the Dephasing and Depletion Limits of Laser-Wakefield Acceleration
Author(s): Debus, A; Pausch, R; Huebl, A; Steiniger, K; Widera, R; Cowan, TE; Schramm, U; Bussmann, M | Abstract: © 2019 authors. Published by the American Physical Society. Compact electron