Accelerating solutions of PDEs with GPU-based swept time-space decomposition

@article{Magee2017AcceleratingSO,
  title={Accelerating solutions of PDEs with GPU-based swept time-space decomposition},
  author={Daniel J. Magee and Kyle E. Niemeyer},
  journal={ArXiv},
  year={2017},
  volume={abs/1705.03162}
}

Figures from this paper

Applying the swept rule for solving explicit partial differential equations on heterogeneous computing systems

This work extends the swept rule by targeting heterogeneous, CPU/GPU architectures representing current and future HPC systems, showing the potential effectiveness of the sweep rule for different equations and numerical schemes on massively parallel compute systems that incur substantial latency costs.

An efficient GPU-based fractional-step domain decomposition scheme for the reaction-diffusion equation

The method effectively accelerates the solution and outperforms the previous methods in terms of computational time and the new efficient prediction and correction schemes presented in this study preserve the accuracy and the stability of the solver even for a large number of sub-domains.

Parallel Numerical Solution of a 2D Chemotaxis-Stokes System on GPUs Technology

The effectiveness of the approach, as well as the coherence of the results with respect to the modeled phenomenon, is provided through numerical evidence, also giving a performance analysis of the serial and the parallel implementations.

Applying the Swept Rule for Solving Two-Dimensional Partial Differential Equations on Heterogeneous Architectures

The swept rule offers both potential for speedups and slowdowns and that care should be taken when designing such a solver to maximize benefits, and these results can help make decisions to maximize these benefits and inform designs.

References

SHOWING 1-10 OF 27 REFERENCES

The swept rule for breaking the latency barrier in time advancing two-dimensional PDEs

A method to accelerate parallel, explicit time integration of two-dimensional unsteady PDEs by decomposing space and time among computing nodes in ways that exploit the domains of influence and the domain of dependency, and effectively breaks the latency barrier.

Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures

  • K. DattaM. Murphy K. Yelick
  • Computer Science
    2008 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2008
This work explores multicore stencil (nearest-neighbor) computations - a class of algorithms at the heart of many structured grid codes, including PDE solvers - and develops a number of effective optimization strategies, and builds an auto-tuning environment that searches over the optimizations and their parameters to minimize runtime, while maximizing performance portability.

Solving the compressible Navier-Stokes equations on up to 1.97 million cores and 4.1 trillion grid points

The use of hyperthreading is discussed, which significantly improves the parallel performance of the code of the Hybrid code, a finite-difference solver of the compressible Navier-Stokes equations on structured grids used for the direct numerical simulation of isotropic turbulence and its interaction with shock waves.

How to obtain efficient GPU kernels: An illustration using FMM & FGT algorithms

Parallel time integration with multigrid

The resulting multigrid‐reduction‐in‐time (MGRIT) algorithms are non‐intrusive approaches, which directly use an existing time propagator and, thus, can easily exploit substantially more computational resources then standard sequential time‐stepping.

50 Years of Time Parallel Time Integration

This chapter is for people who want to quickly gain an overview of the exciting and rapidly developing area of research of time parallel methods.

Multicore-Optimized Wavefront Diamond Blocking for Optimizing Stencil Updates

This work combines the ideas of multicore wavefront temporal blocking and diamond tiling to arrive at stencil update schemes that show large reductions in memory pressure compared to existing approaches, and provides a controllable trade-off between concurrency and memory usage.

Communication-Avoiding QR Decomposition for GPUs

We describe an implementation of the Communication-Avoiding QR (CAQR) factorization that runs entirely on a single graphics processor (GPU). We show that the reduction in memory traffic provided by