Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures

  title={Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures},
  author={Tal Ben-Nun and Johannes de Fine Licht and Alexandros Nikolaos Ziogas and Timo Schneider and Torsten Hoefler},
  journal={Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
  • Tal Ben-Nun, J. D. F. Licht, T. Hoefler
  • Published 27 February 2019
  • Computer Science
  • Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis
The ubiquity of accelerators in high-performance computing has driven programming complexity beyond the skill-set of the average domain scientist. To maintain performance portability in the future, it is imperative to decouple architecture-specific programming paradigms from the underlying scientific computations. We present the Stateful DataFlow multiGraph (SDFG), a data-centric intermediate representation that enables separating program definition from its optimization. By combining fine… 

Productive Performance Engineering for Weather and Climate Modeling with Python

This work presents a detailed account of optimizing the Finite Volume Cubed-Sphere (FV3) weather model, improving productivity and performance by using a declarative Python-embedded stencil DSL and data-centric optimization.

Productivity, portability, performance: data-centric Python

This work presents a workflow that retains Python's high productivity while achieving portable performance across different architectures and includes HPC-oriented language extensions and a set of automatic optimizations powered by a data-centric intermediate representation.

Data Movement Is All You Need: A Case Study on Optimizing Transformers

This work finds that data movement is the key bottleneck when training, and presents a recipe for globally optimizing data movement in transformers to achieve a 1.30x performance improvement over state-of-the-art frameworks when training BERT.

Optimizing the data movement in quantum transport simulations via data-centric parallel programming

A global, data-centric view of a state-of-the-art quantum transport simulator is considered to optimize its execution on supercomputers and yields coarse-and fine-grained data-movement characteristics, which are used for performance and communication modeling, communication-avoidance, and data-layout transformations.

StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems

The general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component is considered, and StencilFlow maximizes temporal locality and ensures deadlock freedom in this setting.

Survey: Exploiting Data Redundancy for Optimization of Deep Learning

This article surveys hundreds of recent papers on data redundancy, introduces a novel taxonomy to put the various techniques into a single categorization framework, and offers a comprehensive description of the main methods used for exploiting data redundancy in improving multiple kinds of DNNs on data.

SWPy: Python Numerical Computing Library Optimization for Domestic Many-core Processors

This paper makes full use of the huge computational resources of sunway multicore processors to optimize the parallel acceleration of common key functions in the NumPy library, forming the heterogeneous multicore Python computation library - SWPy.

Boosting Performance Optimization with Interactive Data Movement Visualization

This paper proposes an approach that combines static data analysis with parameterized program simulations to analyze both global data movement andained data access and reuse behavior, and visualize insights in-situ on the program representation.

Deinsum: Practically I/O Optimal Multilinear Algebra

This work presents Deinsum, an automated framework for distributed multilinear algebra computations expressed in Einstein notation, based on rigorous mathematical tools to address the problem of deriving data movement-optimal distributed schedules for programs with many high-dimensional inputs.

Interfacing SYCL and Python for XPU Programming

The core design and implementation detail of the framework is presented that includes an overview of the API, a technique to support asynchronous SYCL kernel execution via Python, and discussion around the usage of Python extension generator tools to build SYCL-based extensions.



A Programming Language Interface to Describe Transformations and Code Generation

It is demonstrated that the automatically-generated code either performs closely or outperforms two hand-tuned GPU library kernels from Nvidia's CUBLAS 2.2 and 3.2 libraries.

Polyhedral parallel code generation for CUDA

A novel source-to-source compiler called PPCG is presented, which introduces a multilevel tiling strategy and a code generation scheme for the parallelization and locality optimization of imperfectly nested loops, managing memory and exposing concurrency according to the constraints of modern GPUs.

OpenMP: an industry standard API for shared-memory programming

At its most elemental level, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran (and separately, C and C++ to express shared memory parallelism. It

Generating performance portable code using rewrite rules: from high-level functional expressions to high-performance OpenCL code

This work proposes a novel approach aiming to combine high-level programming, code portability, and high-performance by applying a simple set of rewrite rules to transform it into a low-level functional representation close to the OpenCL programming model, from which OpenCL code is generated.

A lightweight infrastructure for graph analytics

This paper argues that existing DSLs can be implemented on top of a general-purpose infrastructure that supports very fine-grain tasks, implements autonomous, speculative execution of these tasks, and allows application-specific control of task scheduling policies.

CHiLL : A Framework for Composing High-Level Loop Transformations

A general and robust loop transformation framework that enables compilers to generate efficient code on complex loop nests and shows performance results on automaticallygenerated code for the Pentium M and MIPS R10000 that are comparable to the best hand-tuned codes, and significantly better than the native compilers.

The program dependence graph and its use in optimization

An intermediate program representation, called the program dependence graph (PDG), that makes explicit both the data and control dependences for each operation in a program, allowing transformations to be triggered by one another and applied only to affected dependences.

PolyBench: The Polyhedral Benchmark suite. https: //

  • 2016

Tiramisu: A Code Optimization Framework for High Performance Systems

Tiramisu is introduced, an optimization framework designed to generate efficient code for high-performance systems such as multicores, GPUs, FPGAs, distributed machines, or any combination of these, and introduces a novel four-level IR that allows full separation between algorithms, schedules, data-layouts and communication.

Gluon: a communication-optimizing substrate for distributed heterogeneous graph analytics

This paper introduces a new approach to building distributed-memory graph analytics systems that exploits heterogeneity in processor types (CPU and GPU), partitioning policies, and programming models, and Gluon, a communication-optimizing substrate that enables these programs to run on heterogeneous clusters and optimizes communication in a novel way.