Lifting C semantics for dataflow optimization

@article{Calotoiu2022LiftingCS,
  title={Lifting C semantics for dataflow optimization},
  author={Alexandru Calotoiu and Tal Ben-Nun and Grzegorz Kwasniewski and Johannes de Fine Licht and Timo Schneider and Phillipp Schaad and Torsten Hoefler},
  journal={Proceedings of the 36th ACM International Conference on Supercomputing},
  year={2022}
}
C is the lingua franca of programming and almost any device can be programmed using C. However, programming modern heterogeneous architectures such as multi-core CPUs and GPUs requires explicitly expressing parallelism as well as device-specific properties such as memory hierarchies. The resulting code is often hard to understand, debug, and modify for different architectures. We propose to lift C programs to a parametric dataflow representation that lends itself to static data-centric analysis… 

Figures and Tables from this paper

Boosting Performance Optimization with Interactive Data Movement Visualization

This paper proposes an approach that combines static data analysis with parameterized program simulations to analyze both global data movement andained data access and reuse behavior, and visualize insights in-situ on the program representation.

Productive Performance Engineering for Weather and Climate Modeling with Python

This work presents a detailed account of optimizing the Finite Volume Cubed-Sphere (FV3) weather model, improving productivity and performance by using a declarative Python-embedded stencil DSL and data-centric optimization.

References

SHOWING 1-10 OF 59 REFERENCES

Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model

This work proposes an automatic transformation framework to optimize arbitrarily-nested loop sequences with affine dependences for parallelism and locality simultaneously and finds good tiling hyperplanes by embedding a powerful and versatile cost function into an Integer Linear Programming formulation.

PolyBench: The Polyhedral Benchmark suite

  • 2016

PolyBench: The Polyhedral Benchmark suite

  • 2016

Polly - Performing Polyhedral Optimizations on a Low-Level Intermediate Representation

Polly is presented, an infrastructure for polyhedral optimizations on the compiler's internal, low-level, intermediate representation (IR) and an interface for connecting external optimizers and a novel way of using the parallelism they introduce to generate SIMD and OpenMP code is presented.

OpenMP: an industry standard API for shared-memory programming

At its most elemental level, OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran (and separately, C and C++ to express shared memory parallelism. It

MAPS: Optimizing Massively Parallel Applications Using Device-Level Memory Abstraction

  • 2014

StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems

The general case of mapping directed acyclic graphs of heterogeneous stencil computations to spatial computing systems, assuming large input programs without an iterative component is considered, and StencilFlow maximizes temporal locality and ensures deadlock freedom in this setting.

Data Movement Is All You Need: A Case Study on Optimizing Transformers

This work finds that data movement is the key bottleneck when training, and presents a recipe for globally optimizing data movement in transformers to achieve a 1.30x performance improvement over state-of-the-art frameworks when training BERT.

MLIR: A Compiler Infrastructure for the End of Moore's Law

Evaluation of MLIR as a generalized infrastructure that reduces the cost of building compilers-describing diverse use-cases to show research and educational opportunities for future programming languages, compilers, execution environments, and computer architecture.

A data-centric approach to extreme-scale ab initio dissipative quantum transport simulations

The restructured QT simulator is able to treat realistic nanoelectronic devices made of more than 10,000 atoms within a 14x shorter duration than the original code needs to handle a system with 1, thousand atoms, on the same number of CPUs/GPUs and with the same physical accuracy.
...