Memory Safe Computations with XLA Compiler

@article{Artemev2022MemorySC,
  title={Memory Safe Computations with XLA Compiler},
  author={Artem Artemev and Tilman Roeder and Mark van der Wilk},
  journal={ArXiv},
  year={2022},
  volume={abs/2206.14148}
}
Software packages like TensorFlow and PyTorch are designed to support linear algebra operations, and their speed and usability determine their success. However, by prioritising speed, they often neglect memory requirements. As a consequence, the implementations of memory-intensive algorithms that are convenient in terms of software design can often not be run for large problems due to memory overflows. Memory-efficient solutions require complex programming approaches with significant logic outside… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 41 REFERENCES

Glow: Graph Lowering Compiler Techniques for Neural Networks

Glow features a lowering phase which enables the compiler to support a high number of input operators as well as a large number of hardware targets by eliminating the need to implement all operators on all targets.

Kernel Operations on the GPU, with Autodiff, without Memory Overflows

The KeOps library provides a fast and memory-efficient GPU support for tensors whose entries are given by a mathematical formula, such as kernel and distance matrices, including PyTorch CUDA tensors or the Halide and TVM libraries.

PyTorch: An Imperative Style, High-Performance Deep Learning Library

This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.

Linnea: Automatic Generation of Efficient Linear Algebra Programs

Linnea is a code generator for linear algebra problems that uses a custom best-first search algorithm to find a first solution in less than a second, and increasingly better solutions when given more time.

Fast geometric learning with symbolic matrices

This paper presents an extension for standard machine learning frameworks that provides comprehensive support for this abstraction on CPUs and GPUs, and performs an extensive evaluation on a broad class of problems: Gaussian modelling, K-nearest neighbors search, geometric deep learning, nonEuclidean embeddings and optimal transport theory.

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

TVM is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.

Kernel methods through the roof: handling billions of points efficiently

This work designed a preconditioned gradient solver for kernel methods exploiting both GPU acceleration and parallelization with multiple GPUs, implementing out-of-core variants of common linear algebra operations to guarantee optimal hardware utilization.

Julia: A Fresh Approach to Numerical Computing

The Julia programming language and its design is introduced---a dance between specialization and abstraction, which recognizes what remains the same after computation, and which is best left untouched as they have been built by the experts.

TensorFlow: A system for large-scale machine learning

The TensorFlow dataflow model is described and the compelling performance that Tensor Flow achieves for several real-world applications is demonstrated.

Mesh-TensorFlow: Deep Learning for Supercomputers

Mesh-TensorFlow is introduced, a language for specifying a general class of distributed tensor computations and used to implement an efficient data-parallel, model-Parallel version of the Transformer sequence-to-sequence model, surpassing state of the art results on WMT'14 English- to-French translation task and the one-billion-word language modeling benchmark.