• Corpus ID: 239050559

A Data-Centric Optimization Framework for Machine Learning

  title={A Data-Centric Optimization Framework for Machine Learning},
  author={Oliver Rausch and Tal Ben-Nun and Nikoli Dryden and Andrei Ivanov and Shigang Li and Torsten Hoefler},
ABSTRACT Rapid progress in deep learning is leading to a diverse set of quickly changing models, with a dramatically growing demand for compute. However, as frameworks specialize optimization to patterns in popular networks, they implicitly constrain novel and diverse models that drive progress in research. We empower deep learning researchers by defining a flexible and user-customizable pipeline for optimizing training of arbitrary deep neural networks, based on data movement minimization. The… 


Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
A language close to the mathematics of deep learning called Tensor Comprehensions offering both imperative and declarative styles, a polyhedral Just-In-Time compiler to convert a mathematical description of a deep learning DAG into a CUDA kernel with delegated memory management and synchronization, and a compilation cache populated by an autotuner are contributed.
Acorns: A Framework for Accelerating Deep Neural Networks with Input Sparsity
This paper proposes Acorns, a framework to accelerate deep neural networks with input sparsity that generates efficient sparse kernels for operators in neural networks from kernel templates, which combine directions that express specific optimizing transformations to be performed, and straightforward code that describes the computation.
TVM: End-to-End Optimization Stack for Deep Learning
TVM is proposed, an end-to-end optimization stack that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and discusses the optimization challenges specific toDeep learning that TVM solves.
Accelerating Deep Learning Frameworks with Micro-Batches
cuDNN is a low-level library that provides GPU kernels frequently used in deep learning. Specifically, cuDNN implements several equivalent convolution algorithms, whose performance and memory
Astra: Exploiting Predictability to Optimize Deep Learning
It is shown that Astra improves end-to-end performance of deep learning training by up to 3x, while approaching the performance of hand-optimized implementations such as cuDNN where available and significantly outperforms static compilation frameworks such as Tensorflow XLA both in performance and robustness.
Value Function Based Performance Optimization of Deep Learning Workloads
This work model this scheduling problem as a sequence of optimization choices, and presents a new technique to accurately predict the expected performance of a partial schedule, which enables them to find schedules that improve the throughput of deep neural networks by 2.6x over Halide and 1.5x over TVM.
PyTorch: An Imperative Style, High-Performance Deep Learning Library
This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.
Training Deep Nets with Sublinear Memory Cost
This work designs an algorithm that costs O( √ n) memory to train a n layer network, with only the computational cost of an extra forward pass per mini-batch, and shows that it is possible to trade computation for memory giving a more memory efficient training algorithm with a little extra computation cost.
cuDNN: Efficient Primitives for Deep Learning
A library similar in intent to BLAS, with optimized routines for deep learning workloads, that contains routines for GPUs, and similarly to the BLAS library, could be implemented for other platforms.
DLVM: A modern compiler infrastructure for deep learning systems
DLVM, a design and implementation of a compiler infrastructure with a linear algebra intermediate representation, algorithmic differentiation by adjoint code generation, domain-specific optimizations and a code generator targeting GPU via LLVM is presented.