Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementations

  title={Enhancing data locality of the conjugate gradient method for high-order matrix-free finite-element implementations},
  author={Martin Kronbichler and Dmytro Sashko and Peter Munch},
This work investigates a variant of the conjugate gradient (CG) method and embeds it into the context of high-order finite-element schemes with fast matrix-free operator evaluation and cheap preconditioners like the matrix diagonal. Relying on a data-dependency analysis and appropriate enumeration of degrees of freedom, we interleave the vector updates and inner products in a CG iteration with the matrix-vector product with only minor organizational overhead. As a result, around 90% of the… 

Stage-parallel fully implicit Runge-Kutta implementations with optimal multilevel preconditioners at the scaling limit

We present an implementation of a fully stage-parallel preconditioner for Radau IIA type fully implicit Runge–Kutta methods, which approximates the inverse of A Q from the Butcher tableau by the

The deal.II library, Version 9.4

An overview of the new features of the finite element library deal, version 9.4.II is provided.



Efficient High-Order Discontinuous Galerkin Finite Elements with Matrix-Free Implementations

This work proposes an element-based shared- memory parallelization option and compares it to a well-established shared-memory parallelization with global face data and shows that merging the more arithmetically heavy operator evaluation with vector operations in application code allows to more than double efficiency on the latest hardware.

A generic interface for parallel cell-based finite element operator application

Efficient Nonlinear Solvers for Nodal High-Order Finite Elements in 3D

This work presents a method in which the action of the Jacobian is applied matrix-free exploiting a tensor product basis on hexahedral elements, while much sparser matrices based on Q1 sub-elements on the nodes of the high-order basis are assembled for preconditioning.

Hiding global synchronization latency in the preconditioned Conjugate Gradient algorithm

Fast Matrix-Free Evaluation of Discontinuous Galerkin Finite Element Operators

We present an algorithmic framework for matrix-free evaluation of discontinuous Galerkin finite element operators. It relies on fast quadrature with sum factorization on quadrilateral and hexahedral

Multigrid for Matrix-Free High-Order Finite Element Computations on Graphics Processors

A GPU parallelization of a matrix-free geometric multigrid iterative solver targeting moderate and high polynomial degrees, with support for general curved and adaptively refined hexahedral meshes with hanging nodes is developed.

Matrix-free finite-element computations on graphics processors with adaptively refined unstructured meshes

A GPU parallelization of the matrix-free method including a novel algorithm for resolving hanging-node constraints on the GPU, capable of simulation on adaptively refined grids and can solve problems 8 times larger in 3D.

hyper.deal: An Efficient, Matrix-free Finite-element Library for High-dimensional Partial Differential Equations

This work presents the efficient, matrix-free finite-element library hyper.deal for solving partial differential equations in two to six dimensions with high-order discontinuous Galerkin methods and reports results for high-dimensional advection problems and for the solution of the Vlasov--Poisson equation in up to 6D phase space.

A stencil scaling approach for accelerating matrix-free finite element implementations

We present a novel approach to fast on-the-fly low order finite element assembly for scalar elliptic partial differential equations of Darcy type with variable coefficients optimized for matrix-free

The Communication-Hiding Conjugate Gradient Method with Deep Pipelines

This work extends the pipelined CG method to deeper pipelines, which allows further scaling when the global communication phase is the dominant time-consuming factor and is able to hide communication latency behind computational work.