StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems

@article{Licht2021StencilFlowML,
  title={StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems},
  author={Johannes de Fine Licht and Andreas Kuster and Tiziano De Matteis and Tal Ben-Nun and Dominic Hofer and Torsten Hoefler},
  journal={2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO)},
  year={2021},
  pages={315-326}
}
Spatial computing devices have been shown to significantly accelerate stencil computations, but have so far relied on unrolling the iterative dimension of a single stencil operation to increase temporal locality. This work considers the general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component. StencilFlow maximizes temporal locality and ensures deadlock freedom in this setting… 

Transformations of High-Level Synthesis Codes for High-Performance Computing

TLDR
A collection of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications, is presented, aiming to establish a common toolbox to guide both performance engineers and compiler engineers in tapping into the performance potential offered by spatial computing architectures using HLS.

Productivity, portability, performance: data-centric Python

TLDR
This work presents a workflow that retains Python's high productivity while achieving portable performance across different architectures and includes HPC-oriented language extensions and a set of automatic optimizations powered by a data-centric intermediate representation.

FPGA Acceleration of Structured-Mesh-Based Explicit and Implicit Numerical Solvers using SYCL

TLDR
A performance model predicts the runtime of designs with high accuracy with less than 5% error for all cases tested, demonstrating significant utility for design space exploration and providing insights into the feasibility and profitability FPGA implementation.

High throughput multidimensional tridiagonal system solvers on FPGAs

TLDR
A high performance tridiagonal solver library for Xilinx FPGAs optimized for multiple multi-dimensional systems common in real-world applications is presented, achieving an order of magnitude better performance when solving large batches of systems than previous FPGA work.

Lifting C semantics for dataflow optimization

TLDR
This work proposes to lift C programs to a parametric dataflow representation that lends itself to static data-centric analysis and enables automatic high-performance code generation and can identify parallelism when no other compiler can.

A data-centric optimization framework for machine learning

TLDR
This work empower deep learning researchers by defining a flexible and user-customizable pipeline for optimizing training of arbitrary deep neural networks, based on data movement minimization, with competitive performance or speedups on ten different networks.

Accelerating Weather Prediction using Near-Memory Reconfigurable Fabric

TLDR
NERO, an FPGA+HBM-based accelerator connected through OCAPI (Open Coherent Accelerator Processor Interface) to an IBM POWER9 host system is developed and it is concluded that employing near-memory acceleration solutions for weather prediction modeling is promising as a means to achieve both high performance and high energy efficiency.

FLOWER: A comprehensive dataflow compiler for high-level synthesis

TLDR
This work presents FLOWER, a comprehensive compiler infrastructure that provides automatic canonical transformations for high-level synthesis from a domain-specific library that allows programmers to focus on algorithm implementations rather than low-level optimizations for dataflow architectures.

The digital revolution of Earth-system science

TLDR
The present limitations in the field are discussed and the design of a novel infrastructure that is scalable and more adaptable to future, yet unknown computing architectures is proposed.

High-Level FPGA Accelerator Design for Structured-Mesh-Based Explicit Numerical Solvers

This paper presents a workflow for synthesizing near-optimal FPGA implementations of structured-mesh based stencil applications for explicit solvers. It leverages key characteristics of the

References

SHOWING 1-10 OF 36 REFERENCES

SODA: Stencil with Optimized Dataflow Architecture

TLDR
SODA, an automated framework for implementing Stencil algorithms with Optimized Dataflow Architecture on FPGAs, minimizes the on-chip reuse buffer size required by full data reuse and provides flexible and scalable fine-grained parallelism.

MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures

TLDR
MODESTO is introduced, a model-driven stencil optimization framework that for a stencil program suggests program transformations optimized for a given target architecture and how to automatically tune stencil programs is shown.

AN5D: automated stencil framework for high-degree temporal blocking on GPUs

TLDR
AN5D is proposed, an automated stencil framework which is capable of automatically transforming and optimizing stencil patterns in a given C source code, and generating corresponding CUDA code and Parameter tuning in the framework is guided by the performance model.

High performance stencil code generation with Lift

TLDR
This paper demonstrates how complex multidimensional stencil code and optimizations such as tiling are expressible using compositions of simple 1D Lift primitives, and shows that this approach outperforms existing compiler approaches and hand-tuned codes.

Multi-FPGA Accelerator for Scalable Stencil Computation with Constant Memory Bandwidth

TLDR
This paper designs a custom computing machine (CCM) called a scalable streaming-array (SSA), for high-performance stencil computations with multiple field-programmable gate arrays (FPGAs) based on a domain-specific programmable concept.

NARMADA: Near-Memory Horizontal Diffusion Accelerator for Scalable Stencil Computations

TLDR
This work offloads a horizontal diffusion kernel, which is a compound stencil kernel, from the COSMO weather prediction application to a reconfigurable fabric, and introduces a memory hierarchy tailored to the targeted application and using a coherent memory model, which improves memory efficiency.

Fast Stencil-Code Computation on a Wafer-Scale Processor

TLDR
The solution of large, sparse, and often structured systems of linear equations must be solved on the Cerebras Systems CS-1, a wafer-scale processor that has the memory bandwidth and communication latency to perform well.

Combined Spatial and Temporal Blocking for High-Performance Stencil Computation on FPGAs Using OpenCL

TLDR
This work creates a stencil accelerator using Intel FPGA SDK for OpenCL that achieves high performance without having such restrictions by combining spatial and temporal blocking to avoid input size restrictions, and employs multiple FPGAs-specific optimizations to tackle issues arisen from the added design complexity.

Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures

TLDR
The Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that enables separating program definition from its optimization, is presented, allowing domain scientists to develop applications naturally and port them to approach peak hardware performance without modifying the original scientific code.

Transformations of High-Level Synthesis Codes for High-Performance Computing

TLDR
A collection of optimizing transformations for HLS, targeting scalable and efficient architectures for high-performance computing (HPC) applications, is presented, aiming to establish a common toolbox to guide both performance engineers and compiler engineers in tapping into the performance potential offered by spatial computing architectures using HLS.