LoopStack: a Lightweight Tensor Algebra Compiler Stack

  title={LoopStack: a Lightweight Tensor Algebra Compiler Stack},
  author={Bram Wasti and Jos{\'e} Pablo Cambronero and Benoit Steiner and Hugh Leather and Aleksandar Zlateski},
We present LoopStack, a domain specific compiler stack for tensor operations, composed of a frontend, LoopTool, and an efficient optimizing code generator, LoopNest. This stack enables us to compile entire neural networks and generate code targeting the AVX2, AVX512, NEON, and NEONfp16 instruction sets while incorporating optimizations often missing from other machine learning compiler backends. We evaluate our stack on a collection of full neural networks and commonly used network blocks as… 



LIBXSMM: Accelerating Small Matrix Multiplications by Runtime Code Generation

This work presents a library which provides high performance small matrix multiplications targeting all recent x86 vector instruction set extensions up to Intel AVX-512, and accompanies this library with a BLAS-compliant frontend which features a multi-level code-cache hierarchy.

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

TVM is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.

cuDNN: Efficient Primitives for Deep Learning

A library similar in intent to BLAS, with optimized routines for deep learning workloads, that contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms.

PyTorch: An Imperative Style, High-Performance Deep Learning Library

This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.

Learning to Optimize Tensor Programs

A learning-based framework to optimize tensor programs for deep learning workloads that learns domain-specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants and accelerates the search by effective model transfer across workloads.

TensorFlow: A system for large-scale machine learning

The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated.

Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines

A systematic model of the tradeoff space fundamental to stencil pipelines is presented, a schedule representation which describes concrete points in this space for each stage in an image processing pipeline, and an optimizing compiler for the Halide image processing language that synthesizes high performance implementations from a Halide algorithm and a schedule are presented.

Polly – Polyhedral optimization in LLVM

Polly is presented, a project to enable polyhedral optimizations in LLVM that automatically detects and transforms relevant program parts in a language-independent and syntactically transparent way and supports programs written in most common programming languages and constructs.

Anatomy of High-Performance Many-Threaded Matrix Multiplication

This work describes how BLIS extends the "GotoBLAS approach" to implementing matrix multiplication (GEMM), and shows that with the advent of many-core architectures such as the IBM PowerPC A2 processor and the Intel Xeon Phi processor, parallelizing both within and around the inner kernel, as the BLIS approach supports, is not only convenient, but also necessary for scalability.

The tensor algebra compiler

The first compiler technique to automatically generate kernels for any compound tensor algebra operation on dense and sparse tensors is introduced, which is competitive with best-in-class hand-optimized kernels in popular libraries, while supporting far more tensor operations.