Dynamic Space-Time Scheduling for GPU Inference
@article{Jain2018DynamicSS, title={Dynamic Space-Time Scheduling for GPU Inference}, author={Paras Jain and Xiangxi Mo and Ajay Jain and Harikaran Subbaraj and Rehana Durrani and Alexey Tumanov and Joseph E. Gonzalez and Ion Stoica}, journal={ArXiv}, year={2018}, volume={abs/1901.00041} }
Serving deep neural networks in latency critical interactive settings often requires GPU acceleration. However, the small batch sizes typical in online inference results in poor GPU utilization, a potential performance gap which GPU resource sharing can address. In this paper, we explore several techniques to leverage both temporal and spatial multiplexing to improve GPU utilization for deep learning inference workloads. We evaluate the performance trade-offs of each approach with respect to…
42 Citations
GSLICE: controlled spatial sharing of GPUs for a scalable inference platform
- Computer ScienceSoCC
- 2020
GSLICE virtualizes the GPU by apportioning the GPU resources across different Inference Functions (IFs), thus providing isolation and guaranteeing performance and develops self-learning and adaptive GPU resource allocation and batching schemes that account for network traffic characteristics, while also keeping inference latencies below service level objectives.
Primitives Enhancing GPU Runtime Support for Improved DNN Performance
- Computer Science2021 IEEE 14th International Conference on Cloud Computing (CLOUD)
- 2021
This work presents a DNN inference framework with a set of software primitives that reduce the overhead for Dnn inference, increase GPU utilization and improve performance, with lower latency and higher throughput.
SwitchFlow: preemptive multitasking for deep learning
- Computer ScienceMiddleware
- 2021
SwitchFlow is presented, a scheduling framework for DL multitasking that achieves up to an order of magnitude lower tail latency for inference requests collocated with a training job and maintains multiple versions of each subgraph, thereby enabling low-latency preemption.
Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision
- Computer ScienceArXiv
- 2022
This paper surveys existing research efforts for both training and inference workloads and primarily presents how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features.
The OoO VLIW JIT Compiler for GPU Inference
- Computer ScienceArXiv
- 2019
A VLIW-inspired Out-of-Order (OoO) Just-in-Time (JIT) compiler that coalesces and reorders execution kernels at runtime for throughput-optimal device utilization while satisfying latency SLOs is proposed.
EDGE: Event-Driven GPU Execution
- Computer Science2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT)
- 2019
An event-driven GPU execution model that enables non-CPU devices to directly launch preconfigured tasks on a GPU without CPU interaction is proposed, and it is estimated that EDGE can reduce the kernel launch latency by 4.4x compared to the baseline CPU-launched approach.
GPU-NEST: Characterizing Energy Efficiency of Multi-GPU Inference Servers
- Computer ScienceIEEE Computer Architecture Letters
- 2020
This work proposes GPU-NEST, an energy efficiency characterization methodology for multi-GPU inference systems, and finds that inference scheduling in particular has great benefits in improving the energy efficiency of multi- GPU scheduling, by as much as 40 percent.
Spatial Sharing of GPU for Autotuning DNN models
- Computer ScienceArXiv
- 2020
Improvements to DNN autotuning time are decreased by up to 75 percent and throughput is increased by a factor of 5, allowing controlled spatial sharing of GPU to multiplex several tuning applications on the GPU.
Scrooge: A Cost-Effective Deep Learning Inference System
- Computer ScienceSoCC
- 2021
Scrooge, a system that provides media applications as a service, achieves these objectives by packing computations efficiently into GPU-equipped cloud VMs, using an optimization formulation to find the lowest cost VM allocations that meet the performance objectives.
Serving Heterogeneous Machine Learning Models on Multi-GPU Servers with Spatio-Temporal Sharing
- Computer ScienceUSENIX Annual Technical Conference
- 2022
This paper proposes a new inference scheduling framework for multi-model ML inference servers that auto-scales the required number of GPUs for a given workloads, minimizing the cost for cloud-based inference servers.
References
SHOWING 1-10 OF 21 REFERENCES
Enabling Task Parallelism in the CUDA Scheduler
- Computer Science
- 2009
An issue queue that merges workloads that would underutilize GPU processing resources such that they can be run concurrently on an NVIDIA GPU is proposed and throughput is increased in all cases where the GPU would have been underused by a single kernel.
TVM: End-to-End Optimization Stack for Deep Learning
- Computer ScienceArXiv
- 2018
TVM is proposed, an end-to-end optimization stack that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and discusses the optimization challenges specific toDeep learning that TVM solves.
In-datacenter performance analysis of a tensor processing unit
- Computer Science2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)
- 2017
This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.
Clipper: A Low-Latency Online Prediction Serving System
- Computer ScienceNSDI
- 2017
Clipper is introduced, a general-purpose low-latency prediction serving system that introduces a modular architecture to simplify model deployment across frameworks and applications and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks.
Applied Machine Learning at Facebook: A Datacenter Infrastructure Perspective
- Computer Science2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)
- 2018
The hardware and software infrastructure that supports machine learning at global scale is described, leveraging both GPU and CPU platforms for training and abundant CPU capacity for real-time inference.
Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions
- Computer ScienceArXiv
- 2018
A language close to the mathematics of deep learning called Tensor Comprehensions offering both imperative and declarative styles, a polyhedral Just-In-Time compiler to convert a mathematical description of a deep learning DAG into a CUDA kernel with delegated memory management and synchronization, and a compilation cache populated by an autotuner are contributed.
MobileNetV2: Inverted Residuals and Linear Bottlenecks
- Computer Science2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
- 2018
A new mobile architecture, MobileNetV2, is described that improves the state of the art performance of mobile models on multiple tasks and benchmarks as well as across a spectrum of different model sizes and allows decoupling of the input/output domains from the expressiveness of the transformation.
TensorFlow-Serving: Flexible, High-Performance ML Serving
- Computer ScienceArXiv
- 2017
TensorFlow-Serving is described, a system to serve machine learning models inside Google which is also available in the cloud and via open-source, and ways to integrate with systems that convey new models and updated versions from training to serving.
Very Deep Convolutional Networks for Large-Scale Image Recognition
- Computer ScienceICLR
- 2015
This work investigates the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting using an architecture with very small convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers.
Densely Connected Convolutional Networks
- Computer Science2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2017
The Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion, and has several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.