• Corpus ID: 227228418

Value Function Based Performance Optimization of Deep Learning Workloads

  title={Value Function Based Performance Optimization of Deep Learning Workloads},
  author={Benoit Steiner and Chris Cummins and Horace He and Hugh Leather},
As machine learning techniques become ubiquitous, the efficiency of neural network implementations is becoming correspondingly paramount. Frameworks, such as Halide and TVM, separate out the algorithmic representation of the network from the schedule that determines its implementation. Finding good schedules, however, remains extremely challenging. We model this scheduling problem as a sequence of optimization choices, and present a new technique to accurately predict the expected performance… 

Figures and Tables from this paper

A data-centric optimization framework for machine learning
This work empower deep learning researchers by defining a flexible and user-customizable pipeline for optimizing training of arbitrary deep neural networks, based on data movement minimization, with competitive performance or speedups on ten different networks.
TenSet: A Large-scale Program Performance Dataset for Learned Tensor Compilers
This work introduces TenSet, a large-scale tensor program performance dataset, and provides comprehensive studies on how to learn and evaluate the cost models, including data collection, model architectures, loss functions, transfer learning, and evaluation metrics.


TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
TVM is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.
Learning to Optimize Tensor Programs
A learning-based framework to optimize tensor programs for deep learning workloads that learns domain-specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants and accelerates the search by effective model transfer across workloads.
Learning to optimize halide with tree search and random programs
This work presents a new algorithm to automatically schedule Halide programs for high-performance image processing and deep learning that produces schedules which are on average almost twice as fast as the existing Halide autoscheduler without autotuning, or more than two as fast with, and is the first automatic scheduling algorithm to significantly outperform human experts on average.
PyTorch: An Imperative Style, High-Performance Deep Learning Library
This paper details the principles that drove the implementation of PyTorch and how they are reflected in its architecture, and explains how the careful and pragmatic implementation of the key components of its runtime enables them to work together to achieve compelling performance.
Ansor : Generating High-Performance Tensor Programs for Deep Learning
Ansor is presented, a tensor program generation framework for deep learning applications that can find high-performance programs that are outside the search space of existing state-of-the-art approaches.
FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System
FlexTensor can optimize tensor computation programs without human interference, allowing programmers to only work on high-level programming abstraction without considering the hardware platform details.
Loop transformations leveraging hardware prefetching
This work proposes an optimization algorithm that analytically classifies an algorithmic description of a loop nest in order to decide whether it should be optimized stressing its temporal or spatial locality, while also taking hardware prefetching into account.
Decoupling algorithms from schedules for easy optimization of image processing pipelines
This work proposes a representation for feed-forward imaging pipelines that separates the algorithm from its schedule, enabling high-performance without sacrificing code clarity, and demonstrates the power of this representation by expressing a range of recent image processing applications in an embedded domain specific language called Halide and compiling them for ARM, x86, and GPUs.
Reinforcement Learning: An Introduction
This book provides a clear and simple account of the key ideas and algorithms of reinforcement learning, which ranges from the history of the field's intellectual foundations to the most recent developments and applications.
, Twan Basten , and Lou Somers . Loop transformations leveraging hardware prefetching