Toward Performance-Portable PETSc for GPU-based Exascale Systems

@article{Mills2021TowardPP,
  title={Toward Performance-Portable PETSc for GPU-based Exascale Systems},
  author={Richard Tran Mills and Mark F. Adams and Satish Balay and Jed Brown and Alp Dener and Matthew G. Knepley and Scott E. Kruger and Hannah Morgan and Todd Munson and Karl Rupp and Barry F. Smith and Stefano Zampini and Hong Zhang and Junchao Zhang},
  journal={Parallel Comput.},
  year={2021},
  volume={108},
  pages={102831}
}
The Portable Extensible Toolkit for Scientific computation (PETSc) library delivers scalable solvers for nonlinear time-dependent differential and algebraic equations and for numerical optimization.The PETSc design for performance portability addresses fundamental GPU accelerator challenges and stresses flexibility and extensibility by separating the programming model used by the application from that used by the library, and it enables application developers to use their preferred programming… 
The PetscSF Scalable Communication Layer
TLDR
The design of PetscSF is discussed, how it can overcome some difficulties of working directly with MPI on GPUs, and its performance, scalability, and novel features are demonstrated.
H2Opus: A distributed-memory multi-GPU software package for non-local operators
TLDR
This paper presents high-performance, distributed-memory GPU-accelerated algorithms and implementations for matrix-vector multiplication and matrix recompression of hierarchical matrices in the H2 format and demonstrates scalability up to 16M degrees of freedom problems on 64 GPUs.
The PETSc Community Is the Infrastructure
TLDR
A case study of the PETSc (Portable Extensible Toolkit for Scientific Computation) community, its organization, and technical approaches that enable community members to help each other efficiently are presented.
libEnsemble: A Library to Coordinate the Concurrent Evaluation of Dynamic Ensembles of Calculations
Almost all applications stop scaling at some point; those that don't are seldom performant when considering time to solution on anything but aspirational/unicorn resources. Recognizing these

References

SHOWING 1-10 OF 47 REFERENCES
PETSc Users Manual
The Portable, Extensible Toolkit for Scientific Computation (PETSc), is a suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial
The PetscSF Scalable Communication Layer
TLDR
The design of PetscSF is discussed, how it can overcome some difficulties of working directly with MPI on GPUs, and its performance, scalability, and novel features are demonstrated.
RAJA: Portable Performance for Large-Scale Scientific Applications
TLDR
RAJA is described, a portability layer that enables C++ applications to leverage various programming models, and thus architectures, with a single-source codebase, and preliminary results using RAJA are described.
Preliminary Implementation of PETSc Using GPUs
TLDR
A new subclass of the vector class has been introduced that performs its operations on NVIDIA GPU processors and a new sparse matrix subclass that performs matrix-vector products on the GPU was introduced.
Preparing sparse solvers for exascale computing
TLDR
The challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms are described, addressing the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential.
ViennaCL - Linear Algebra Library for Multi- and Many-Core Architectures
TLDR
This work presents the linear algebra library ViennaCL, which is built on top of all three programming models, thus enabling computational scientists to interface to a single library, yet obtain high performance for all three hardware types.
Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers
TLDR
This investigation presents an investigation showing that other high-performance computing (HPC) applications can also harness this power of floating-point arithmetic, and shows how using half-precision Tensor Cores (FP16-TC) for the arithmetic can provide up to 4× speedup.
An investigation of the performance portability of OpenCL
TLDR
The development of an MPI/OpenCL implementation of LU, an application-level benchmark from the NAS Parallel Benchmark Suite, is reported, demonstrating the importance of memory arrangement and work-item/work-group distribution strategies when applications are deployed on different device types.
Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems
TLDR
Aluminum is implemented, enabling optimized, asynchronous, GPU-aware communication, and demonstrates improved performance in benchmarks and end-to-end training of deep networks, for both strong and weak scaling.
Kokkos: Enabling manycore performance portability through polymorphic memory access patterns
TLDR
Kokkos’ abstractions are described, its application programmer interface (API) is summarized, performance results for unit-test kernels and mini-applications are presented, and an incremental strategy for migrating legacy C++ codes to Kokkos is outlined.
...
1
2
3
4
5
...