Toward Performance-Portable PETSc for GPU-based Exascale Systems
@article{Mills2020TowardPP, title={Toward Performance-Portable PETSc for GPU-based Exascale Systems}, author={Richard Tran Mills and Mark F. Adams and Satish Balay and Jed Brown and Alp Dener and Matthew G. Knepley and Scott E. Kruger and Hannah Morgan and Todd S. Munson and Karl Rupp and Barry F. Smith and Stefano Zampini and Hong Zhang and Junchao Zhang}, journal={ArXiv}, year={2020}, volume={abs/2011.00715} }
Figures and Tables from this paper
14 Citations
The PetscSF Scalable Communication Layer
- Computer ScienceIEEE Transactions on Parallel and Distributed Systems
- 2022
The design of PetscSF is discussed, how it can overcome some difficulties of working directly with MPI on GPUs, and its performance, scalability, and novel features are demonstrated.
A Performance Analysis of Modern Parallel Programming Models Using a Compute-Bound Application
- Computer Science
- 2021
The performance of what appear to be the most promising modern parallel programming models are analyzed on a diverse range of contemporary high-performance hardware, using a compute-bound molecular docking mini-app and a higher-level framework such as SYCL can achieve OpenMP levels of performance while aiding productivity.
H2Opus: a distributed-memory multi-GPU software package for non-local operators
- Computer ScienceAdv. Comput. Math.
- 2022
This paper presents high-performance, distributed-memory GPU-accelerated algorithms and implementations for matrix-vector multiplication and matrix recompression of hierarchical matrices in the H 2 ${\mathscr{H}}^{2}$ format and demonstrates scalability up to 16M degrees of freedom problems on 64 GPUs.
Parallel, Portable Algorithms for Distance-2 Maximal Independent Set and Graph Coarsening
- Computer Science2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- 2022
The new MIS-2 implementation outperforms implementations in state of the art libraries like CUSP and ViennaCL by 3-8x while producing similar quality results, and an approach for implementing a parallel multicolor “cluster” Gauss-Seidel preconditioner using this MIS-1 coarsening scheme is described.
libEnsemble: A Library to Coordinate the Concurrent Evaluation of Dynamic Ensembles of Calculations
- Computer ScienceIEEE Transactions on Parallel and Distributed Systems
- 2022
Almost all applications stop scaling at some point; those that don't are seldom performant when considering time to solution on anything but aspirational/unicorn resources. Recognizing these…
PNODE: A memory-efficient neural ODE framework based on high-level adjoint differentiation
- Computer ScienceArXiv
- 2022
A new neural ODE framework based on high-level discrete adjoint algorithmic differentiation, PNODE, which achieves the highest memory efficiency when compared with other reverse-accurate methods and enables the use of the implicit time integration methods that are needed for stiff dynamical systems.
Landau collision operator in the CUDA programming model applied to thermal quench plasmas
- Computer Science, Physics2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- 2022
This paper extends previous work on a conservative, high order accurate, finite element discretization with adaptive mesh refinement of the Landau operator, with extensions to GPU hardware and implementations in both the CUDA and Kokkos programming languages.
Performance Portable Solid Mechanics via Matrix-Free p-Multigrid
- Computer ScienceArXiv
- 2022
This work uses performance models and numerical experiments to demonstrate that high- order methods greatly reduce the costs to reach engineering tolerances while enabling effective use of GPUs, and discusses efficient matrix-free representation of Jacobians and how automatic differentiation enables rapid development of nonlinear material models without impacting debuggability and workflows targeting GPUs.
Designing a Framework for Solving Multiobjective Simulation Optimization Problems
- Computer ScienceArXiv
- 2023
The design goals driving the development of the parallel MOSO library ParMOO were to provide a customizable MOSo framework that allows for exploitation of simulation-based problem structures, ease of deployment in scientific workflows, maintainability, and flexibility in the authors' support for many problem types.
The PETSc Community as Infrastructure
- Computer ScienceComputing in Science & Engineering
- 2022
A case study of the PETSc (Portable Extensible Toolkit for Scientific Computation) community, its organization, and technical approaches that enable community members to help each other efficiently are presented.
28 References
PETSc Users Manual
- Computer Science
- 2019
The Portable, Extensible Toolkit for Scientific Computation (PETSc), is a suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial…
The PetscSF Scalable Communication Layer
- Computer ScienceIEEE Transactions on Parallel and Distributed Systems
- 2022
The design of PetscSF is discussed, how it can overcome some difficulties of working directly with MPI on GPUs, and its performance, scalability, and novel features are demonstrated.
RAJA: Portable Performance for Large-Scale Scientific Applications
- Computer Science2019 IEEE/ACM International Workshop on Performance, Portability and Productivity in HPC (P3HPC)
- 2019
RAJA is described, a portability layer that enables C++ applications to leverage various programming models, and thus architectures, with a single-source codebase, and preliminary results using RAJA are described.
Preliminary Implementation of PETSc Using GPUs
- Computer Science
- 2013
A new subclass of the vector class has been introduced that performs its operations on NVIDIA GPU processors and a new sparse matrix subclass that performs matrix-vector products on the GPU was introduced.
Preparing sparse solvers for exascale computing
- Computer SciencePhilosophical Transactions of the Royal Society A
- 2020
The challenges, strategies and progress of the US Department of Energy Exascale Computing project towards providing sparse solvers for exascale computing platforms are described, addressing the demands of systems with thousands of high-performance node devices where exposing concurrency, hiding latency and creating alternative algorithms become essential.
ViennaCL - Linear Algebra Library for Multi- and Many-Core Architectures
- Computer ScienceSIAM J. Sci. Comput.
- 2016
This work presents the linear algebra library ViennaCL, which is built on top of all three programming models, thus enabling computational scientists to interface to a single library, yet obtain high performance for all three hardware types.
Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed up Mixed-Precision Iterative Refinement Solvers
- Computer ScienceSC18: International Conference for High Performance Computing, Networking, Storage and Analysis
- 2018
This investigation presents an investigation showing that other high-performance computing (HPC) applications can also harness this power of floating-point arithmetic, and shows how using half-precision Tensor Cores (FP16-TC) for the arithmetic can provide up to 4× speedup.
An investigation of the performance portability of OpenCL
- Computer ScienceJ. Parallel Distributed Comput.
- 2013
Aluminum: An Asynchronous, GPU-Aware Communication Library Optimized for Large-Scale Training of Deep Neural Networks on HPC Systems
- Computer Science2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC)
- 2018
Aluminum is implemented, enabling optimized, asynchronous, GPU-aware communication, and demonstrates improved performance in benchmarks and end-to-end training of deep networks, for both strong and weak scaling.
Kokkos: Enabling manycore performance portability through polymorphic memory access patterns
- Computer ScienceJ. Parallel Distributed Comput.
- 2014