Characterizing Concurrency Mechanisms for NVIDIA GPUs under Deep Learning Workloads

  title={Characterizing Concurrency Mechanisms for NVIDIA GPUs under Deep Learning Workloads},
  author={Guin Gilman and Robert J. Walls},

Figures and Tables from this paper

Performance and Power Prediction for Concurrent Execution on GPUs
This paper proposes the first machine learning-based predictor to predict the performance and power of an ensemble of applications on a GPU, and shows that by using the execution statistics of standalone workloads and the fairness of execution when these workloads are executed with three representative microbenchmarks, it can get a reasonably accurate prediction.
Aryl: An Elastic Cluster Scheduler for Deep Learning
Aryl, a new cluster scheduler that introduces the notion of server preemption cost which it greedily reduces during server reclaiming, and relies on the JCT reduction value defined for each additional worker for an elastic job to solve the scheduling problem as a multiple-choice knapsack problem.
Characterizing Concurrency Mechanisms for NVIDIA GPUs under Deep Learning Workloads (Extended Abstract)
Hazelwood et al. observed that at Facebook data centers, variations in user activity (e.g. due to diurnal load) resulted in low utilization periods with large pools of idle resources [4]. To make use


Demystifying the Placement Policies of the NVIDIA GPU Thread Block Scheduler for Concurrent Kernels
This work empirically derive the Scheduler's behavior under concurrent workloads for NVIDIA's Pascal, Volta, and Turing microarchitectures and finds that the scheduler chooses the next SM based on the SM's local resource availability.
Improving GPGPU concurrency with elastic kernels
This work studies concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs, and proposes transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage.
Warped-Slicer: Efficient Intra-SM Slicing through Dynamic Resource Partitioning for GPU Multiprogramming
Warped-Slicer is proposed, a dynamic intra-SM slicing strategy that uses an analytical method for calculating the SM resource partitioning across different kernels that maximizes performance and is also computationally efficient.
Deadline-Based Scheduling for GPU with Preemption Support
This paper presents the design of a prototype real-time scheduler for GPU activities on an embedded System on a Chip featuring a cutting edge GPU architecture by NVIDIA adopted in the autonomous driving domain, and it leverages latest generation architectural features, such as pixel-level preemption and thread level preemption.
Enabling preemptive multiprogramming on GPUs
This paper argues for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies and extends the NVIDIA GK110 (Kepler) like GPU architecture to allow concurrent execution of GPU kernels from different user processes and implements a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities.
Dissecting the CUDA scheduling hierarchy: a Performance and Predictability Perspective
This paper corrects and consolidates previously published assumptions about the hierarchical scheduling policies of NVIDIA GPUs and their proprietary CUDA application programming interface and discusses how such mechanisms evolved with recently released GPU micro-architectures, and how such changes influence the scheduling models to be exploited by real-time system engineers.
AntMan: Dynamic Scaling on GPU Clusters for Deep Learning
AntMan, a deep learning infrastructure that co-designs cluster schedulers with deep learning frameworks and has been deployed in production at Alibaba to manage tens of thousands of daily deep learning jobs across thousands of GPUs, is presented.
GSLICE: controlled spatial sharing of GPUs for a scalable inference platform
GSLICE virtualizes the GPU by apportioning the GPU resources across different Inference Functions (IFs), thus providing isolation and guaranteeing performance and develops self-learning and adaptive GPU resource allocation and batching schemes that account for network traffic characteristics, while also keeping inference latencies below service level objectives.
CuMAS: Data Transfer Aware Multi-Application Scheduling for Shared GPUs
It is demonstrated that the data-transfer aware nature of CuMAS framework improves the throughput of simultaneously executed CUDA applications by up to 44% when run on NVIDIA K40c GPU using applications from CUDA SDK and Rodinia benchmark suite.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
Chimera first introduces streaming multiprocessor flushing, which can instantly preempt an SM by detecting and exploiting idempotent execution, and utilizes flushing collaboratively with two previously proposed preemption techniques for GPUs, namely context switching and draining to minimize throughput overhead while achieving a required preemption latency.