PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units
@article{Choi2020PREMAAP, title={PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units}, author={Yujeong Choi and Minsoo Rhu}, journal={2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)}, year={2020}, pages={220-233} }
To amortize cost, cloud vendors providing DNN acceleration as a service to end-users employ consolidation and virtualization to share the underlying resources among multiple DNN service requests. This paper makes a case for a "preemptible" neural processing unit (NPU) and a "predictive" multi-task scheduler to meet the latency demands of high-priority inference while maintaining high throughput. We evaluate both the mechanisms that enable NPUs to be preemptible and the policies that utilize…
Figures and Tables from this paper
49 Citations
Layerweaver: Maximizing Resource Utilization of Neural Processing Units via Layer-Wise Scheduling
- Computer Science2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)
- 2021
Layerweaver is proposed, an inference serving system with a novel multi-model time-multiplexing scheduler for NPUs that reduces the temporal waste of computation resources by interweaving layer execution of multiple different models with opposing characteristics: compute-intensive and memory-intensive.
Deadline-Aware Offloading for High-Throughput Accelerators
- Computer Science2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)
- 2021
A novel laxity-aware scheduler (LAX) is proposed that uses information collected within the GPU to dynamically vary job priority based on how much laxity jobs have before their deadline, and meets the deadlines of concurrent, latency-sensitive GPU jobs.
SLO-Aware Inference Scheduler for Heterogeneous Processors in Edge Platforms
- Computer ScienceACM Trans. Archit. Code Optim.
- 2021
A set of new heterogeneity-aware ML inference scheduling policies for edge platforms based on the regularity of computation in common ML tasks, which aims to satisfy the service-level objective (SLO) requirement while reducing the energy consumption for each request.
PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers
- Computer ScienceArXiv
- 2022
This paper proposes a sophisticated partitioning algorithm for reconfigurable GPUs that systematically determines a heterogeneous set of multi-granular GPU partitions, best suited for the inference server’s deployment and co-designs an elastic scheduling algorithm tailored for the heterogeneously partitioned GPU server.
VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling
- Computer ScienceASPLOS
- 2022
This work systematically analyze the opportunities and challenges of providing multi-tenant deep learning services on the general-purpose CPU architecture from the aspects of scheduling granularity and code generation, and proposes an adaptive granularity scheduling scheme to both guarantee resource usage efficiency and reduce the scheduling conflict rate.
Lazy Batching: An SLA-aware Batching System for Cloud Machine Learning Inference
- Computer Science2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA)
- 2021
This paper proposes LazyBatching, an SLA-aware batching system that considers both scheduling and batching in the granularity of individual graph nodes, rather than the entire graph for flexible batching.
Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning
- Computer ScienceArXiv
- 2021
A new ML inference scheduling framework for multi-model ML inference servers is proposed and it is shown that with SLO constraints, current GPUs are not fully utilized for ML inference tasks.
Exploration of Systolic-Vector Architecture with Resource Scheduling for Dynamic ML Workloads
- Computer Science
- 2022
A scalable systolic-vector architecture that can cope with dynamically changing DNN workloads in cloud datacenters is presented and a heterogeneity-aware scheduling algorithm is proposed that improves the throughput and energy efficiency by 81% and 20%, respectively, compared to a standard round-robin scheduling.
Enable simultaneous DNN services based on deterministic operator overlap and precise latency prediction
- Computer ScienceSC
- 2021
Abacus enables deterministic operator overlap to enforce latency predictability and reduces 51.3% of the QoS violation and improves the throughput by 29.8% on average compared with state-of-the-art solutions.
DeepRecSys: A System for Optimizing End-To-End At-Scale Neural Recommendation Inference
- Computer Science2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)
- 2020
DeepRecSched is proposed, a recommendation inference scheduler that maximizes latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, model architectures, and underlying hardware systems.
References
SHOWING 1-10 OF 97 REFERENCES
Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers
- Computer ScienceASPLOS
- 2016
Baymax is presented, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization.
Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers
- Computer ScienceASPLOS 2017
- 2017
Guaranteeing Quality-of-Service (QoS) of latency-sensitive applications while improving server utilization through application co-location is important yet challenging in modern datacenters. The key…
Enabling preemptive multiprogramming on GPUs
- Computer Science2014 ACM/IEEE 41st International Symposium on Computer Architecture (ISCA)
- 2014
This paper argues for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies and extends the NVIDIA GK110 (Kepler) like GPU architecture to allow concurrent execution of GPU kernels from different user processes and implements a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
- Computer ScienceASPLOS 2015
- 2015
Chimera first introduces streaming multiprocessor flushing, which can instantly preempt an SM by detecting and exploiting idempotent execution, and utilizes flushing collaboratively with two previously proposed preemption techniques for GPUs, namely context switching and draining to minimize throughput overhead while achieving a required preemption latency.
NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units
- Computer ScienceASPLOS
- 2020
A case for enabling address translation in NPUs to decouple the virtual and physical memory address space is made and a memory management unit (MMU) that is tailored for NPU's is proposed.
Enabling Efficient Preemption for SIMT Architectures with Lightweight Context Switching
- Computer ScienceSC16: International Conference for High Performance Computing, Networking, Storage and Analysis
- 2016
Three complementary ways to reduce and compress the architectural states to achieve lightweight context switching on SIMT processors with compiler and hardware co-design are proposed.
Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks
- Computer Science2018 IEEE International Symposium on High Performance Computer Architecture (HPCA)
- 2018
A high-performance virtualization strategy based on a "compressing DMA engine" (cDMA) that drastically reduces the size of the data structures that are targeted for CPU-side allocations is introduced.
In-datacenter performance analysis of a tensor processing unit
- Computer Science2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)
- 2017
This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.
Improving GPGPU concurrency with elastic kernels
- Computer ScienceASPLOS '13
- 2013
This work studies concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs, and proposes transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage.
A Case for Memory-Centric HPC System Architecture for Training Deep Neural Networks
- Computer ScienceIEEE Computer Architecture Letters
- 2018
This work proposes a memory-centric deep learning system that can transparently expand the memory capacity accessible to the accelerators while also providing fast inter-device communication for parallel training.