• Corpus ID: 239009478

Metrics and Design of an Instruction Roofline Model for AMD GPUs

@article{Leinhauser2021MetricsAD,
  title={Metrics and Design of an Instruction Roofline Model for AMD GPUs},
  author={Matthew Leinhauser and Ren{\'e} Widera and Sergei Bastrakov and Alexander Debus and Michael Bussmann and Sunita Chandrasekaran},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.08221}
}
MATTHEW LEINHAUSER, Center for Advanced Systems Understanding, Germany and University of Delaware, USA RENÉ WIDERA, Helmholtz-Zentrum Dresden-Rossendorf Laboratory, Germany SERGEI BASTRAKOV, Helmholtz-Zentrum Dresden-Rossendorf Laboratory, Germany ALEXANDER DEBUS, Helmholtz-Zentrum Dresden-Rossendorf Laboratory, Germany MICHAEL BUSSMANN, Center for Advanced Systems Understanding, Germany and Helmholtz-Zentrum Dresden-Rossendorf Laboratory, Germany SUNITA CHANDRASEKARAN, University of Delaware… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 20 REFERENCES
An Instruction Roofline Model for GPUs
  • Nan Ding, Samuel Williams
  • Computer Science
    2019 IEEE/ACM Performance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS)
  • 2019
TLDR
The Instruction Roofline incorporates instructions and memory transactions across all memory hierarchies together and provides more performance insights than the FLOP-oriented Roofline Model, i.e., instruction throughput, stride memory access patterns, bank conflicts, and thread predication.
Hierarchical Roofline analysis for GPUs: Accelerating performance optimization for the NERSC‐9 Perlmutter system
TLDR
A methodology to construct a hierarchical Roofline on NVIDIA GPUs and extends it to support reduced precision and Tensor Cores and to analyze three proxy applications: GPP from BerkeleyGW, HPGMG from AMReX, and conv2d from TensorFlow.
Evaluating Performance Portability of OpenMP for SNAP on NVIDIA, Intel, and AMD GPUs Using the Roofline Methodology
In this paper, we show that OpenMP 4.5 based implementation of TestSNAP, a proxy-app for the Spectral Neighbor Analysis Potential (SNAP) in LAMMPS, can be ported across the NVIDIA, Intel, and AMD
GPU-STREAM v2.0: Benchmarking the Achievable Memory Bandwidth of Many-Core Processors Across Diverse Parallel Programming Models
Many scientific codes consist of memory bandwidth bound kernels — the dominating factor of the runtime is the speed at which data can be loaded from memory into the Arithmetic Logic Units, before
A Quantitative Performance Evaluation of Fast on-Chip Memories of GPUs
  • E. Konstantinidis, Y. Cotronis
  • Computer Science
    2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)
  • 2016
TLDR
A set of micro-benchmarks which aim to provide effective bandwidth performance measurements of the on-chip special memories of GPUs and validate the peak measurements on real world problems as provided by the polybench-gpu benchmark suite.
Performance-Portable Many-Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond
TLDR
This work demonstrates how the CUDA-based open-source plasma simulation code PIConGPU can benefit from the tunable kernel execution strategies of the Alpaka library, achieving portability and performance with single-source kernels on conventional CPUs, Power8 CPUs and NVIDIA GPUs.
Tuning and Optimization for a Variety of Many-Core Architectures Without Changing a Single Line of Implementation Code Using the Alpaka Library
TLDR
The general matrix multiplication (GEMM) algorithm is used in this example to prove that Alpaka allows for platform-specific tuning with a single source code and the optimization potential available with vendor-specific compilers when confronted with the heavily templated abstractions of AlPaka.
Roofline: an insightful visual performance model for multicore architectures
TLDR
The Roofline model offers insight on how to improve the performance of software and hardware in the rapidly changing world of connected devices.
Performance Analysis of PIConGPU: Particle-in-Cell on GPUs using NVIDIA’s NSight Systems and NSight Compute
PIConGPU, Particle In Cell on GPUs, is an open source simulations framework for plasma and laser-plasma physics used to develop advanced particle accelerators for radiation therapy of cancer, high
Cache-aware Roofline model: Upgrading the loft
TLDR
This paper analyzes the original Roofline model and proposes a novel approach to provide a more insightful performance modeling of modern architectures by introducing cache-awareness, thus significantly improving the guidelines for application optimization.
...
1
2
...