• Corpus ID: 57373811

Dynamic Space-Time Scheduling for GPU Inference

  title={Dynamic Space-Time Scheduling for GPU Inference},
  author={Paras Jain and Xiangxi Mo and Ajay Jain and Harikaran Subbaraj and Rehana Durrani and Alexey Tumanov and Joseph E. Gonzalez and Ion Stoica},
Serving deep neural networks in latency critical interactive settings often requires GPU acceleration. However, the small batch sizes typical in online inference results in poor GPU utilization, a potential performance gap which GPU resource sharing can address. In this paper, we explore several techniques to leverage both temporal and spatial multiplexing to improve GPU utilization for deep learning inference workloads. We evaluate the performance trade-offs of each approach with respect to… 

Figures and Tables from this paper

GSLICE: controlled spatial sharing of GPUs for a scalable inference platform

GSLICE virtualizes the GPU by apportioning the GPU resources across different Inference Functions (IFs), thus providing isolation and guaranteeing performance and develops self-learning and adaptive GPU resource allocation and batching schemes that account for network traffic characteristics, while also keeping inference latencies below service level objectives.

Primitives Enhancing GPU Runtime Support for Improved DNN Performance

This work presents a DNN inference framework with a set of software primitives that reduce the overhead for Dnn inference, increase GPU utilization and improve performance, with lower latency and higher throughput.

SwitchFlow: preemptive multitasking for deep learning

SwitchFlow is presented, a scheduling framework for DL multitasking that achieves up to an order of magnitude lower tail latency for inference requests collocated with a training job and maintains multiple versions of each subgraph, thereby enabling low-latency preemption.

Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

This paper surveys existing research efforts for both training and inference workloads and primarily presents how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features.

The OoO VLIW JIT Compiler for GPU Inference

A VLIW-inspired Out-of-Order (OoO) Just-in-Time (JIT) compiler that coalesces and reorders execution kernels at runtime for throughput-optimal device utilization while satisfying latency SLOs is proposed.

EDGE: Event-Driven GPU Execution

An event-driven GPU execution model that enables non-CPU devices to directly launch preconfigured tasks on a GPU without CPU interaction is proposed, and it is estimated that EDGE can reduce the kernel launch latency by 4.4x compared to the baseline CPU-launched approach.

GPU-NEST: Characterizing Energy Efficiency of Multi-GPU Inference Servers

This work proposes GPU-NEST, an energy efficiency characterization methodology for multi-GPU inference systems, and finds that inference scheduling in particular has great benefits in improving the energy efficiency of multi- GPU scheduling, by as much as 40 percent.

Spatial Sharing of GPU for Autotuning DNN models

Improvements to DNN autotuning time are decreased by up to 75 percent and throughput is increased by a factor of 5, allowing controlled spatial sharing of GPU to multiplex several tuning applications on the GPU.

Scrooge: A Cost-Effective Deep Learning Inference System

Scrooge, a system that provides media applications as a service, achieves these objectives by packing computations efficiently into GPU-equipped cloud VMs, using an optimization formulation to find the lowest cost VM allocations that meet the performance objectives.

Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing

This paper proposes a new inference scheduling framework for multi-model ML inference servers that auto-scales the required number of GPUs for a given workloads, minimizing the cost for cloud-based inference servers.



Enabling Task Parallelism in the CUDA Scheduler

An issue queue that merges workloads that would underutilize GPU processing resources such that they can be run concurrently on an NVIDIA GPU is proposed and throughput is increased in all cases where the GPU would have been underused by a single kernel.

TVM: End-to-End Optimization Stack for Deep Learning

TVM is proposed, an end-to-end optimization stack that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and discusses the optimization challenges specific toDeep learning that TVM solves.

In-datacenter performance analysis of a tensor processing unit

  • N. JouppiC. Young D. Yoon
  • Computer Science
    2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)
  • 2017
This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.

Clipper: A Low-Latency Online Prediction Serving System

Clipper is introduced, a general-purpose low-latency prediction serving system that introduces a modular architecture to simplify model deployment across frameworks and applications and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks.

Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective

The hardware and software infrastructure that supports machine learning at global scale is described, leveraging both GPU and CPU platforms for training and abundant CPU capacity for real-time inference.

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

A language close to the mathematics of deep learning called Tensor Comprehensions offering both imperative and declarative styles, a polyhedral Just-In-Time compiler to convert a mathematical description of a deep learning DAG into a CUDA kernel with delegated memory management and synchronization, and a compilation cache populated by an autotuner are contributed.

MobileNetV2: Inverted Residuals and Linear Bottlenecks

A new mobile architecture, MobileNetV2, is described that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes and allows decoupling of the input/output domains from the expressiveness of the transformation.

TensorFlow-Serving: Flexible, High-Performance ML Serving

TensorFlow-Serving is described, a system to serve machine learning models inside Google which is also available in the cloud and via open-source, and ways to integrate with systems that convey new models and updated versions from training to serving.

Very Deep Convolutional Networks for Large-Scale Image Recognition

This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.

Densely Connected Convolutional Networks

The Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion, and has several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.