Twinned buffering: A simple and highly effective scheme for parallelization of Successive Over-Relaxation on GPUs and other accelerators

@article{Vanderbauwhede2015TwinnedBA,
  title={Twinned buffering: A simple and highly effective scheme for parallelization of Successive Over-Relaxation on GPUs and other accelerators},
  author={Wim Vanderbauwhede and Tetsuya Takemi},
  journal={2015 International Conference on High Performance Computing \& Simulation (HPCS)},
  year={2015},
  pages={436-443}
}
  • W. Vanderbauwhede, T. Takemi
  • Published 20 July 2015
  • Computer Science
  • 2015 International Conference on High Performance Computing & Simulation (HPCS)
In this paper we present a new scheme for parallelization of the Successive Over-Relaxation method for solving the Poisson equation over a 3-D volume. Our new scheme is both simple and effective, outperforming the conventional Red-Black scheme by a factor of 16 on an NVIDIA GeForce GTX 590 GPU, a factor of 11 on an NVIDIA GeForce TITAN Black GPU and a factor of 5 on an Intel Xeon Phi. The speed-up compared to the fully optimised reference implementation running on an Intel Xeon CPU is 16 times… 

Figures and Tables from this paper

Towards Automatic Transformation of Legacy Scientific Code into OpenCL for Optimal Performance on FPGAs
TLDR
A route to automatic creation of kernels which are optimised for execution in a "streaming" fashion, which gives optimal performance on FPGAs is shown, which shows better FPGA performance against a baseline CPU implementation, and better energy-efficiency against both CPU and GPU implementations.
Accelerating Computational Finance Simulations with OpenCL: a case study
TLDR
This thesis investigates the computational requirements of a scenario based ALM application, which is part of a commercial product offered by Ortec-Finance, and proposes a novel OpenCL implementation, optimized for the Intel Xeon Phi co-processor.

References

SHOWING 1-10 OF 21 REFERENCES
Graphics processing unit acceleration of the red/black SOR method
TLDR
The results prove that the global memory cache added on recent GPU architectures assist achieving high performance without requiring to employ the special memory types provided by the GPU (i.e. shared, texture or constant memory).
Hybrid CPU-GPU Solver for Gradient Domain Processing of Massive Images
TLDR
This paper presents a hybrid parallel implementation of gradient domain processing for seamless stitching of gigapixel panoramas that utilizes MPI, threading and a CUDA based GPU component.
GPU optimized computation of stencil based algorithms
TLDR
The approach described in the paper does not only represent a step forward for the steady state heat conduction problem but also for any other algorithm which performs the numerical solution of partial differential equations or which is stencil based.
A GPU Accelerated Red-Black SOR Algorithm for Computational Fluid Dynamics Problems
TLDR
It is concluded that using the memory hierarchy properly has a key role in improving the computational performance of GPU.
An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code
  • T. ShimokawabeT. Aoki S. Matsuoka
  • Computer Science, Environmental Science
    2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2010
TLDR
This work presents the first full CUDA porting of the high- resolution weather prediction model ASUCA, the first such one to be known to date, and demonstrates over 80-fold speedup and good weak scaling achieving 15.0 TFlops in single precision.
An investigation into the feasibility and benefits of GPU/multicore acceleration of the weather research and forecasting model
  • W. VanderbauwhedeT. Takemi
  • Computer Science, Environmental Science
    2013 International Conference on High Performance Computing & Simulation (HPCS)
  • 2013
TLDR
This work studied the Weather Research and Forecasting Model to assess if GPU acceleration of this type of Numerical Weather Prediction code is both feasible and worthwhile, and developed an extensible system for integrating OpenCL code into large Fortran code bases such as WRF.
An analysis of the feasibility and benefits of GPU/multicore acceleration of the Weather Research and Forecasting model
TLDR
The results of a study of the Weather Research and Forecasting (WRF) model are presented in order to assess if GPU and multicore acceleration of this type of numerical weather prediction (NWP) code is both feasible and worthwhile.
GPU acceleration of numerical weather prediction
TLDR
This paper presents an alternative method of scaling model performance by exploiting emerging architectures using the fine-grain parallelism once used in vector machines, and shows the promise of this approach by demonstrating a 20 times speedup for a computationally intensive portion of the WRF model on an NVIDIA 8800 GTX graphics processing unit (GPU).
In: Numerical Recipes in Fortran 90
TLDR
By default, array expressions and assignments are performed for all elements of the same-shaped arrays referenced, but this can be modified, however, by use of a where construction like this.
Numerical Recipes 3rd Edition: The Art of Scientific Computing
TLDR
This new edition incorporates more than 400 Numerical Recipes routines, many of them new or upgraded, and adopts an object-oriented style particularly suited to scientific applications.
...
...