Pure tensor program rewriting via access patterns (representation pearl)

@article{Smith2021PureTP,
  title={Pure tensor program rewriting via access patterns (representation pearl)},
  author={Gus Henry Smith and Andrew Liu and Steven Lyubomirsky and Scott Davidson and Joseph McMahan and Michael B. Taylor and Luis Ceze and Zachary Tatlock},
  journal={Proceedings of the 5th ACM SIGPLAN International Symposium on Machine Programming},
  year={2021}
}
Tensor kernels in machine learning (ML) often correspond to pure mathematical expressions, making term rewriting an attractive strategy for optimization and mapping to specialized hardware accelerators. However, existing ML intermediate representations (IRs) tend to either be pure but high-level, making low-level rewrites to hardware targets inexpressible, or low-level but impure, hampering the use of term rewriting altogether. This paper introduces Glenside, a pure IR whose core abstraction… 

Figures and Tables from this paper

Verified tensor-program optimization via high-level scheduling rewrites

A lightweight Coq framework for optimizing tensor kernels written in a pure, functional array language capable of deriving the optimizations of existing state-of-the-art languages like Halide and generating comparably performant code is presented.

Composable and Modular Code Generation in MLIR: A Structured and Retargetable Approach to Tensor Compiler Construction

A structured approach to the construction of domainspecific code generators for tensor compilers, with the stated goal of improving the productivity of both compiler engineers and end-users.

Sketch-Guided Equality Saturation: Scaling Equality Saturation to Complex Optimizations of Functional Programs

Sketch-guided equality saturation is introduced, a semi-automatic technique that allows programmers to provide program sketches to guide rewriting and is evaluated for seven complex matrix multiplication optimizations, including loop blocking, vectorization, and multi-threading.

Combining E-Graphs with Abstract Interpretation

This work demonstrates that abstract interpretation and e-graph analysis naturally reinforce each other through a tight integration and develops the theory behind this intuition and presents an exemplar interval arithmetic implementation, which is applied to the FPBench suite of benchmarks.

Specialized Accelerators and Compiler Flows: Replacing Accelerator APIs with a Formal Software/Hardware Interface

This paper proposes a compiler flow termed D2A using the ILA and presents a prototype that demonstrates this flow for deep learning (DL) applications and demonstrates checking the correctness of the resulting code through both formal verification of individual matched operations, as well as fully automated simulation-based validation of complete applications.

Automatic Datapath Optimization using E-Graphs

It is demonstrated that modern rewriting frameworks can adequately capture a wide variety of complex optimizations performed by human designers on bit-vector manipulating code, including significant error-prone subtleties regarding the validity of transformations under complex interactions of bitwidths.

Sketch-Guided Equality Saturation: Scaling Equality Saturation to Complex Optimizations in Languages with Bindings

This paper demonstrates how to drastically improve the efficiency of equality saturation for a functional language based on the typed lambda calculus and introduces sketch-guided equality saturation, a semi-automatic technique that allows programmers to provide sketches guiding rewriting when performing complex optimizations.

Optimizing data reshaping operations in functional IRs for high-level synthesis

This paper presents an approach with rewrite rules to solve this fundamental issue and produce efficient FPGA designs from functional IRs and shows that without them, low performance designs are produced, or even worse, it is impossible to synthesize the designs at all.

Caviar: an e-graph based TRS for automatic code optimization

Caviar is presented, an e-graph-based TRS for proving expressions within compilers that can prove expressions much faster than base e- graph TRSs and is evaluated on Halide, an optimizing compiler that relies on a greedy-algorithm- based TRS to simplify and prove its expressions.

IMpress: Large Integer Multiplication Expression Rewriting for FPGA HLS

This work regards determining the level and order of multiplication decomposition as a phase ordering problem, which is a notable problem in compiler optimization, and develops a framework, IMpress, to automatically produce a wide range of equivalent integer multiplication expressions corresponding to various hardware implementations.

References

SHOWING 1-10 OF 44 REFERENCES

Learning to Optimize Tensor Programs

A learning-based framework to optimize tensor programs for deep learning workloads that learns domain-specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants and accelerates the search by effective model transfer across workloads.

SPORES: Sum-Product Optimization via Relational Equality Saturation for Large Scale Linear Algebra

This work introduces a general optimization technique for LA expressions, by converting the LA expressions into Relational Algebra (RA) expressions, optimizing the latter, then converting the result back to (optimized) LA expressions.

The tensor algebra compiler

The first compiler technique to automatically generate kernels for any compound tensor algebra operation on dense and sparse tensors is introduced, which is competitive with best-in-class hand-optimized kernels in popular libraries, while supporting far more tensor operations.

A High-Performance Sparse Tensor Algebra Compiler in Multi-Level IR

The results show that the performance of automatically generated kernels outperforms the state-of-the-art sparse tensor algebra compiler, with up to 20.92x, 6.39x, and 13.9x performance improvement, for parallel SpMV, SpMM, and TTM over TACO, respectively.

Verifying and improving Halide’s term rewriting system with program synthesis

This work builds an automatic program synthesis system in order to craft new, provably correct rules from failure cases where the compiler was unable to prove properties, and demonstrates that the synthesizer can produce better rules than hand-authored ones in five bug fixes.

Ansor : Generating High-Performance Tensor Programs for Deep Learning

Ansor is presented, a tensor program generation framework for deep learning applications that can find high-performance programs that are outside the search space of existing state-of-the-art approaches.

Achieving high-performance the functional way: a functional pearl on expressing high-performance optimizations as rewrite strategies

This functional pearl presents two functional languages that work together - each addressing a separate concern and shows how the holistic functional approach achieves competitive performance with the state-of-the-art imperative systems Halide and TVM.

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

A language close to the mathematics of deep learning called Tensor Comprehensions offering both imperative and declarative styles, a polyhedral Just-In-Time compiler to convert a mathematical description of a deep learning DAG into a CUDA kernel with delegated memory management and synchronization, and a compilation cache populated by an autotuner are contributed.

PyTorch: An Imperative Style, High-Performance Deep Learning Library

This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.

Relay: A High-Level IR for Deep Learning

The functional, statically-typed Relay IR unifies and generalizes existing DL IRs and can express state-of-the-art models and can eliminate abstraction overhead and target new hardware platforms.