Massively parallel lattice-Boltzmann codes on large GPU clusters

@article{Calore2016MassivelyPL,
  title={Massively parallel lattice-Boltzmann codes on large GPU clusters},
  author={E. Calore and A. Gabbana and J. Kraus and E. Pellegrini and S. Schifano and R. Tripiccione},
  journal={Parallel Comput.},
  year={2016},
  volume={58},
  pages={1-24}
}
Abstract This paper describes a massively parallel code for a state-of-the art thermal lattice–Boltzmann method. Our code has been carefully optimized for performance on one GPU and to have a good scaling behavior extending to a large number of GPUs. Versions of this code have been already used for large-scale studies of convective turbulence. GPUs are becoming increasingly popular in HPC applications, as they are able to deliver higher performance than traditional processors. Writing efficient… Expand
Optimization of lattice Boltzmann simulations on heterogeneous computers
TLDR
Common data layouts are defined enabling the code to exploit the different parallel and vector options of the various accelerators efficiently, and matching the possibly different requirements of the compute-bound and memory-bound kernels of the application. Expand
GPU Acceleration of the HemeLB Code for Lattice Boltzmann Simulations in Sparse Complex Geometries
We present an implementation and scaling analysis of a GPU-accelerated kernel for HemeLB, a high-performance Lattice Boltzmann code for sparse complex geometries. We describe the structure of the GPUExpand
Portable multi-node LQCD Monte Carlo simulations using OpenACC
This paper describes a state-of-the-art parallel Lattice QCD Monte Carlo code for staggered fermions, purposely designed to be portable across different computer architectures, including GPUs andExpand
A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters
TLDR
This work utilizes the lattice Boltzmann method for fluid flow as a representative of a scientific computing application and develops a holistic implementation for large-scale CPU/GPU heterogeneous clusters, showing excellent scalability behavior making it future-proof for heterogeneous cluster of the upcoming architectures on the exaFLOPS scale. Expand
Evaluation of a Directive-Based GPU Programming Approach for High-Order Unstructured Mesh Computational Fluid Dynamics
TLDR
This work finds that sparse matrix vector multiplication with OpenCL is faster than using OpenACC with CuBLAS, and the directive based approach offered by OpenACC results in a flexible, unified and hence smaller code-base that is easier to maintain, is readily portable and promotes algorithm development. Expand
Scalable GPU Communication with Code Generation on Stencil Applications
TLDR
An improvement to the CUDA-based communication of stencil applications in the WALBERLA framework is presented, achieving scalability while supporting different GPUs and communication infrastructures and it is shown that packing achieves almost linear weak scaling behavior in the Santos Dumont supercomputer with up to 128 GPUs. Expand
Hybrid MPI and CUDA Parallelization for CFD Applications on Multi-GPU HPC Clusters
TLDR
A hybrid parallel algorithm of the message passing interface and CUDA for CFD applications on multi-GPU HPC clusters is proposed and the one-dimensional domain decomposition method is used to balance the workload among GPUs. Expand
A Dynamic Task-Based D3Q19 Lattice-Boltzmann Method for Heterogeneous Architectures
TLDR
This paper presents a dynamic task-based D3Q19 LBM implementation using three runtime systems for heterogeneous architectures: OmpSs, StarPU, and XKaapi, and details the implementations and compares performance over two heterogeneous platforms. Expand
Physically based visual simulation of the Lattice Boltzmann method on the GPU: a survey
TLDR
An up-to-date survey on the research regarding the LBM for fluid simulation using GPUs is given, discussing how the method was implemented with different GPU architectures and software frameworks, focusing on optimization techniques and their performance. Expand
Early Experience on Using Knights Landing Processors for Lattice Boltzmann Applications
TLDR
In the OpenMP code this work considers several memory data-layouts that meet the conflicting computing requirements of distinct parts of the application, and sustain a large fraction of peak performance, and makes some performance comparisons with other processors and accelerators. Expand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 44 REFERENCES
Benchmarking GPUs with a Parallel Lattice-Boltzmann Code
TLDR
This paper considers a state-of-theart two-dimensional LB model based on 37 populations, that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equation- of-state of a perfect gas, and breaks the 1 double-precision Tflops barrier on a single-host system with two GPUs. Expand
On Portability, Performance and Scalability of an MPI OpenCL Lattice Boltzmann Code
TLDR
A performance assessment of a massively parallel and portable Lattice Boltzmann code, based on the Open Computing Language (OpenCL) and the Message Passing Interface (MPI), and techniques to move data between accelerators minimizing overheads of communication latencies are presented. Expand
Early Experience on Porting and Running a Lattice Boltzmann Code on the Xeon-Phi Co-Processor
TLDR
The D2Q37 LB algorithm considered in this paper is an appropriate test-bed for this architecture since the critical computing kernels require high performances both in terms of memory bandwidth for sparse memory access patterns and number crunching capability. Expand
Performance and portability of accelerated lattice Boltzmann applications with OpenACC
TLDR
This paper describes the multi‐node implementation and optimization of the Boltzmann algorithm, using OpenACC and MPI, and asses the performance impact associated with portable programming, and the actual portability and performance‐portability of OpenACC‐based applications across several state‐of‐the‐art architectures. Expand
An Optimized Lattice Boltzmann Code for BlueGene/Q
TLDR
This paper considers a state-of-art LB code, that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equations of state of a perfect gas and describes the implementation strategies, based on previous experience made on clusters of many-core processors and GPUs. Expand
A Portable OpenCL Lattice Boltzmann Code for Multi- and Many-core Processor Architectures
TLDR
This work shows that a properly structured OpenCL code runs on many different systems reaching performance levels close to those obtained by architecture-tuned CUDA or C codes. Expand
Exploiting parallelism in many-core architectures: Lattice Boltzmann models as a test case
TLDR
A state-of-the-art Lattice Boltzmann model, that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equations of state of a perfect gas, is considered as a test-bed and is a production-ready code already in use for large scale simulations of the Rayleigh-Taylor instability. Expand
Performance issues on many-core processors: A D2Q37 Lattice Boltzmann scheme as a test-case
TLDR
This paper presents the implementation of the Lattice Boltzmann code on the Sandy Bridge processor, and assess the efficiency of several programming strategies and data-structure organizations, both in terms of memory access and computing performance. Expand
Accelerating Lattice Boltzmann Applications with OpenACC
TLDR
This paper implements and optimize a massively parallel Lattice Boltzmann code using OpenACC and OpenMPI, and compares performance with that of the same algorithm written in CUDA, OpenCL and C for GPUs, Xeon-Phi and traditional multi-core CPUs. Expand
Optimization And Profiling Of The Cache Performance Of Parallel Lattice Boltzmann Codes
When designing and implementing highly efficient scientific applications for parallel computers such as clusters of workstations, it is inevitable to consider and to optimize the single-CPUExpand
...
1
2
3
4
5
...