Memory Safe Computations with XLA Compiler

  title={Memory Safe Computations with XLA Compiler},
  author={Artem Artemev and Tilman Roeder and Mark van der Wilk},
Software packages like TensorFlow and PyTorch are designed to support linear algebra operations, and their speed and usability determine their success. However, by prioritising speed, they often neglect memory requirements. As a consequence, the implementations of memory-intensive algorithms that are convenient in terms of software design can often not be run for large problems due to memory overflows. Memory-efficient solutions require complex programming approaches with significant logic outside… 
1 Citations

Figures and Tables from this paper



Kernel Operations on the GPU, with Autodiff, without Memory Overflows

The KeOps library provides a fast and memory-efficient GPU support for tensors whose entries are given by a mathematical formula, such as kernel and distance matrices, including PyTorch CUDA tensors or the Halide and TVM libraries.

PyTorch: An Imperative Style, High-Performance Deep Learning Library

This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.

Linnea: Automatic Generation of Efficient Linear Algebra Programs

Linnea is a code generator for linear algebra problems that uses a custom best-first search algorithm to find a first solution in less than a second, and increasingly better solutions when given more time.

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

TVM is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.

Kernel methods through the roof: handling billions of points efficiently

This work designed a preconditioned gradient solver for kernel methods exploiting both GPU acceleration and parallelization with multiple GPUs, implementing out-of-core variants of common linear algebra operations to guarantee optimal hardware utilization.

Julia: A Fresh Approach to Numerical Computing

The Julia programming language and its design is introduced---a dance between specialization and abstraction, which recognizes what remains the same after computation, and which is best left untouched as they have been built by the experts.

TensorFlow: A system for large-scale machine learning

The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated.

Mesh-TensorFlow: Deep Learning for Supercomputers

Mesh-TensorFlow is introduced, a language for specifying a general class of distributed tensor computations and used to implement an efficient data-parallel, model-Parallel version of the Transformer sequence-to-sequence model, surpassing state of the art results on WMT'14 English- to-French translation task and the one-billion-word language modeling benchmark.

The generalized matrix chain algorithm

A generalized version of the matrix chain algorithm is presented to generate efficient code for linear algebra problems, a task for which human experts often invest days or even weeks of works.

Exact Gaussian Processes on a Million Data Points

A scalable approach for exact GPs is developed that leverages multi-GPU parallelization and methods like linear conjugate gradients, accessing the kernel matrix only through matrix multiplication, and is generally applicable, without constraints to grid data or specific kernel classes.