Optimization of lattice Boltzmann simulations on heterogeneous computers

@article{Calore2019OptimizationOL,
  title={Optimization of lattice Boltzmann simulations on heterogeneous computers},
  author={E. Calore and A. Gabbana and S. Schifano and R. Tripiccione},
  journal={The International Journal of High Performance Computing Applications},
  year={2019},
  volume={33},
  pages={124 - 139}
}
High-performance computing systems are more and more often based on accelerators. Computing applications targeting those systems often follow a host-driven approach, in which hosts offload almost all compute-intensive sections of the code onto accelerators; this approach only marginally exploits the computational resources available on the host CPUs, limiting overall performances. The obvious step forward is to run compute-intensive kernels in a concurrent and balanced way on both hosts and… Expand
Design and Optimizations of Lattice Boltzmann Methods for Massively Parallel GPU-Based Clusters
GPUs deliver higher performance than traditional processors, offering remarkable energy efficiency, and are quickly becoming very popular processors for HPC applications. Still, writing efficient andExpand
A Holistic Scalable Implementation Approach of the Lattice Boltzmann Method for CPU/GPU Heterogeneous Clusters
TLDR
This work utilizes the lattice Boltzmann method for fluid flow as a representative of a scientific computing application and develops a holistic implementation for large-scale CPU/GPU heterogeneous clusters, showing excellent scalability behavior making it future-proof for heterogeneous cluster of the upcoming architectures on the exaFLOPS scale. Expand
Early Experience on Using Knights Landing Processors for Lattice Boltzmann Applications
TLDR
In the OpenMP code this work considers several memory data-layouts that meet the conflicting computing requirements of distinct parts of the application, and sustain a large fraction of peak performance, and makes some performance comparisons with other processors and accelerators. Expand
Performance and Energy Assessment of a Lattice Boltzmann Method Based Application on the Skylake Processor
This paper presents the performance analysis for both the computing performance and the energy efficiency of a Lattice Boltzmann Method (LBM) based application, used to simulate three-dimensionalExpand
Performance and energy consumption of HPC workloads on a cluster based on Arm ThunderX2 CPU
TLDR
The results show that ThunderX2 delivers similar or better energy-to-solution and scalability, proving that Arm-based chips are legitimate contenders in the market of next-generation HPC systems. Expand
High‐performance SIMD implementation of the lattice‐Boltzmann method on the Xeon Phi processor
TLDR
A high‐performance implementation of the lattice‐Boltzmann method (LBM) on the Knights Landing generation of Xeon Phi, which exceeds 960 million fluid lattice site updates per second in benchmark simulations of fluid flow through porous media. Expand
Analysis of GPU Data Access Patterns on Complex Geometries for the D3Q19 Lattice Boltzmann Algorithm
TLDR
Strong evidence is found that semi-direct addressing is often better suited than the more common indirect addressing, providing increased computational speed and reducing memory consumption and present the first near-optimal strong results for LBM with arterial geometries run on GPUs. Expand
Software and DVFS Tuning for Performance and Energy-Efficiency on Intel KNL Processors
TLDR
This work focuses on the computing and energy performance of the Knights Landing Xeon Phi, the latest Intel many-core architecture processor for HPC applications and assess the dependence of energy consumption on data-layouts, memory configurations (DDR4 or MCDRAM), and the number of threads per core. Expand
Advanced Performance Analysis of HPC Workloads on Cavium ThunderX
TLDR
It is demonstrated that performance analysis tools available on standard HPC platforms, independently from the CPU providers, are nowadays available also for Arm SoCs, and actually optimize an HPC application for this platforms, showing similarities with other architectures. Expand
Unstructured Computations on Emerging Architectures
TLDR
This dissertation describes detailed performance engineering and optimization of an unstructured computational aerodynamics software system with irregular memory accesses on various multiand many-core emerging high performance computing scalable architectures, which are expected to be the building blocks of energy-austere exascale systems. Expand
...
1
2
3
...

References

SHOWING 1-10 OF 44 REFERENCES
Performance and portability of accelerated lattice Boltzmann applications with OpenACC
TLDR
This paper describes the multi‐node implementation and optimization of the Boltzmann algorithm, using OpenACC and MPI, and asses the performance impact associated with portable programming, and the actual portability and performance‐portability of OpenACC‐based applications across several state‐of‐the‐art architectures. Expand
Optimizing communications in multi-GPU Lattice Boltzmann simulations
TLDR
This paper looks at the interplay between data organization and data layout, data-communication options and overlapping of communication and computation in Lattice Boltzmann Methods, considering as a use case a state-of-the-art two-dimensional LB model that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equation- of-state of a perfect gas. Expand
Benchmarking GPUs with a Parallel Lattice-Boltzmann Code
TLDR
This paper considers a state-of-theart two-dimensional LB model based on 37 populations, that accurately reproduces the thermo-hydrodynamics of a 2D-fluid obeying the equation- of-state of a perfect gas, and breaks the 1 double-precision Tflops barrier on a single-host system with two GPUs. Expand
Performance issues on many-core processors: A D2Q37 Lattice Boltzmann scheme as a test-case
TLDR
This paper presents the implementation of the Lattice Boltzmann code on the Sandy Bridge processor, and assess the efficiency of several programming strategies and data-structure organizations, both in terms of memory access and computing performance. Expand
Experience on Vectorizing Lattice Boltzmann Kernels for Multi- and Many-Core Architectures
TLDR
This work considers a state-of-the-art two-dimensional LB model, that accurately reproduces the thermo-hydrodynamics of a 2D-fluid and writes a single code that runs efficiently onto traditional multi-core processors as well as on recent many-core systems such as the Xeon-Phi. Expand
Massively parallel lattice-Boltzmann codes on large GPU clusters
TLDR
A massively parallel code for a state-of-the art thermal lattice–Boltzmann method able to deliver a sustained performance of several tens of Tflops as well as a design and optimization methodology that can be used for the development of other high performance applications for computational physics. Expand
Early Experience on Porting and Running a Lattice Boltzmann Code on the Xeon-Phi Co-Processor
TLDR
The D2Q37 LB algorithm considered in this paper is an appropriate test-bed for this architecture since the critical computing kernels require high performances both in terms of memory bandwidth for sparse memory access patterns and number crunching capability. Expand
Accelerating fluid-solid simulations (Lattice-Boltzmann & Immersed-Boundary) on heterogeneous architectures
We propose a numerical approach based on the Lattice-Boltzmann (LBM) and Immersed Boundary (IB) methods to tackle the problem of the interaction of solids with an incompressible fluid flow, and itsExpand
Performance Evaluation of Parallel Large-Scale Lattice Boltzmann Applications on Three Supercomputing Architectures
TLDR
It will be shown that a high speed communication network in combination with an efficient CPU is mandatory in order to achieve the required performance. Expand
Benchmarking MIC architectures with Monte Carlo simulations of spin glass systems
TLDR
This paper port and optimize for many-core processors a Monte Carlo code for the simulation of the 3D Edwards Anderson spin glass, focusing on a dual eight-core Sandy Bridge processor, and on a Xeon-Phi co-processor based on the new Many Integrated Core architecture. Expand
...
1
2
3
4
5
...