Toward Performance-Portable PETSc for GPU-based Exascale Systems

  title={Toward Performance-Portable PETSc for GPU-based Exascale Systems},
  author={Richard Tran Mills and Mark F. Adams and Satish Balay and Jed Brown and Alp Dener and Matthew G. Knepley and Scott E. Kruger and Hannah Morgan and Todd S. Munson and Karl Rupp and Barry F. Smith and Stefano Zampini and Hong Zhang and Junchao Zhang},

The PetscSF Scalable Communication Layer

The design of PetscSF is discussed, how it can overcome some difficulties of working directly with MPI on GPUs, and its performance, scalability, and novel features are demonstrated.

A Performance Analysis of Modern Parallel Programming Models Using a Compute-Bound Application

The performance of what appear to be the most promising modern parallel programming models are analyzed on a diverse range of contemporary high-performance hardware, using a compute-bound molecular docking mini-app and a higher-level framework such as SYCL can achieve OpenMP levels of performance while aiding productivity.

H2Opus: a distributed-memory multi-GPU software package for non-local operators

This paper presents high-performance, distributed-memory GPU-accelerated algorithms and implementations for matrix-vector multiplication and matrix recompression of hierarchical matrices in the H 2 ${\mathscr{H}}^{2}$ format and demonstrates scalability up to 16M degrees of freedom problems on 64 GPUs.

Parallel, Portable Algorithms for Distance-2 Maximal Independent Set and Graph Coarsening

The new MIS-2 implementation outperforms implementations in state of the art libraries like CUSP and ViennaCL by 3-8x while producing similar quality results, and an approach for implementing a parallel multicolor “cluster” Gauss-Seidel preconditioner using this MIS-1 coarsening scheme is described.

libEnsemble: A Library to Coordinate the Concurrent Evaluation of Dynamic Ensembles of Calculations

Almost all applications stop scaling at some point; those that don't are seldom performant when considering time to solution on anything but aspirational/unicorn resources. Recognizing these

PNODE: A memory-efficient neural ODE framework based on high-level adjoint differentiation

A new neural ODE framework based on high-level discrete adjoint algorithmic differentiation, PNODE, which achieves the highest memory efficiency when compared with other reverse-accurate methods and enables the use of the implicit time integration methods that are needed for stiff dynamical systems.

Landau collision operator in the CUDA programming model applied to thermal quench plasmas

This paper extends previous work on a conservative, high order accurate, finite element discretization with adaptive mesh refinement of the Landau operator, with extensions to GPU hardware and implementations in both the CUDA and Kokkos programming languages.

Performance Portable Solid Mechanics via Matrix-Free p-Multigrid

This work uses performance models and numerical experiments to demonstrate that high- order methods greatly reduce the costs to reach engineering tolerances while enabling effective use of GPUs, and discusses efficient matrix-free representation of Jacobians and how automatic differentiation enables rapid development of nonlinear material models without impacting debuggability and workflows targeting GPUs.

Designing a Framework for Solving Multiobjective Simulation Optimization Problems

The design goals driving the development of the parallel MOSO library ParMOO were to provide a customizable MOSo framework that allows for exploitation of simulation-based problem structures, ease of deployment in scientific workflows, maintainability, and flexibility in the authors' support for many problem types.

The PETSc Community as Infrastructure

A case study of the PETSc (Portable Extensible Toolkit for Scientific Computation) community, its organization, and technical approaches that enable community members to help each other efficiently are presented.

PETSc Users Manual

The Portable, Extensible Toolkit for Scientific Computation (PETSc), is a suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial

The PetscSF Scalable Communication Layer

The design of PetscSF is discussed, how it can overcome some difficulties of working directly with MPI on GPUs, and its performance, scalability, and novel features are demonstrated.

RAJA: Portable Performance for Large-Scale Scientific Applications

RAJA is described, a portability layer that enables C++ applications to leverage various programming models, and thus architectures, with a single-source codebase, and preliminary results using RAJA are described.

Preliminary Implementation of PETSc Using GPUs

A new subclass of the vector class has been introduced that performs its operations on NVIDIA GPU processors and a new sparse matrix subclass that performs matrix-vector products on the GPU was introduced.

Preparing sparse solvers for exascale computing

The challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms are described, addressing the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential.

ViennaCL - Linear Algebra Library for Multi- and Many-Core Architectures

This work presents the linear algebra library ViennaCL, which is built on top of all three programming models, thus enabling computational scientists to interface to a single library, yet obtain high performance for all three hardware types.

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers

This investigation presents an investigation showing that other high-performance computing (HPC) applications can also harness this power of floating-point arithmetic, and shows how using half-precision Tensor Cores (FP16-TC) for the arithmetic can provide up to 4× speedup.

Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems

Aluminum is implemented, enabling optimized, asynchronous, GPU-aware communication, and demonstrates improved performance in benchmarks and end-to-end training of deep networks, for both strong and weak scaling.

Kokkos: Enabling manycore performance portability through polymorphic memory access patterns