Hardware Acceleration of HPC Computational Flow Dynamics using HBM-enabled FPGAs

@article{Hogervorst2021HardwareAO,
  title={Hardware Acceleration of HPC Computational Flow Dynamics using HBM-enabled FPGAs},
  author={Tom Hogervorst and Tong Dong Qiu and Giacomo Marchiori and Alf Birger Rustad and Markus Blatt and Răzvan Nane},
  journal={ArXiv},
  year={2021},
  volume={abs/2101.01745}
}
Scientific computing is at the core of many HighPerformance Computing (HPC) applications, including computational flow dynamics. Because of the uttermost importance to simulate increasingly larger computational models, hardware acceleration is receiving increased attention due to its potential to maximize the performance of scientific computing. A FieldProgrammable Gate Array (FPGA) is a reconfigurable hardware accelerator that is fully customizable in terms of computational resources and… 

References

SHOWING 1-10 OF 28 REFERENCES
A Hybrid Approach for Mapping Conjugate Gradient onto an FPGA-Augmented Reconfigurable Supercomputer
TLDR
A parameterized, parallelized, deeply pipelined, dual-FPGA, IEEE-754 64-bit floating-point design for accelerating the conjugate gradient (CG) iterative method on an FPGA-augmented RC that can achieve a 4 fold speedup on a next-generation RC.
A High Memory Bandwidth FPGA Accelerator for Sparse Matrix-Vector Multiplication
TLDR
This paper introduces an FPGA-optimized SMVM architecture and a novel sparse matrix encoding that explicitly exposes parallelism across rows, while keeping the hardware complexity and on-chip memory usage low.
Sparse Matrix-Vector multiplication on FPGAs
TLDR
Besides solving SpMXV problem, the design provides a parameterized and flexible tree-based design for floating-point applications on FPGAs, which demonstrates significant speedup over general-purpose processors particularly for matrices with very irregular sparsity structure.
A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs
TLDR
This paper describes an FPGA-based SpMxV kernel that is scalable to efficiently utilize the available memory bandwidth and computing resources and is able to achieve higher performance than its CPU and GPU counterparts, while using only 64 single-precision processing elements.
High-Performance Architecture for the Conjugate Gradient Solver on FPGAs
TLDR
This brief proposes a high-performance architecture for the CG solver on FPGAs, which can handle sparse linear systems with arbitrary size and sparsity pattern and does not need aggressive zero padding.
Sparstition: A Partitioning Scheme for Large-Scale Sparse Matrix Vector Multiplication on FPGA
TLDR
Spparse Matrix Vector Multiplication is a key kernel in various domains, that is known to be difficult to parallelize efficiently due to the low spatial locality of data, and sparstition, a novel partitioning scheme that enables computing SpMV without the need to do any major post-processing steps is presented.
A Survey and Evaluation of FPGA High-Level Synthesis Tools
  • R. Nane, V. Sima, +9 authors K. Bertels
  • Engineering, Computer Science
    IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
  • 2016
TLDR
This work uses a first-published methodology to compare one commercial and three academic tools on a common set of C benchmarks, aiming at performing an in-depth evaluation in terms of performance and the use of resources.
Automatically Tuning Sparse Matrix-Vector Multiplication for GPU Architectures
TLDR
A new storage format for sparse matrices is presented that better employs locality, has low memory footprint and enables automatic specialization for various matrices and future devices via parameter tuning.
Optimization of Block Sparse Matrix-Vector Multiplication on Shared-Memory Parallel Architectures
  • Ryan Eberhardt, M. Hoemmen
  • Computer Science
    2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • 2016
TLDR
This paper gives a set of algorithms that performs SpMV up to 4x faster than the NVIDIA cuSPARSE cusparseDbsrmv routine, up to 147x better than the Intel Math Kernel Library (MKL) mkl dbsRMv routine (a single-threaded BCSR SpMv kernel), and up to 3x fasterthan the MKL mkL dcsrmvoutine (a multi-threading CSR SpmV kernel).
An efficient sparse conjugate gradient solver using a Beneš permutation network
TLDR
A heuristics for offline scheduling is described, the effect of which is captured in a parametric model for estimating the performance of designs generated from the approach.
...
1
2
3
...