PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units

@article{Choi2020PREMAAP,
  title={PREMA: A Predictive Multi-Task Scheduling Algorithm For Preemptible Neural Processing Units},
  author={Yujeong Choi and Minsoo Rhu},
  journal={2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)},
  year={2020},
  pages={220-233}
}
  • Yujeong Choi, Minsoo Rhu
  • Published 6 September 2019
  • Computer Science
  • 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA)
To amortize cost, cloud vendors providing DNN acceleration as a service to end-users employ consolidation and virtualization to share the underlying resources among multiple DNN service requests. This paper makes a case for a "preemptible" neural processing unit (NPU) and a "predictive" multi-task scheduler to meet the latency demands of high-priority inference while maintaining high throughput. We evaluate both the mechanisms that enable NPUs to be preemptible and the policies that utilize… 
Layerweaver: Maximizing Resource Utilization of Neural Processing Units via Layer-Wise Scheduling
TLDR
Layerweaver is proposed, an inference serving system with a novel multi-model time-multiplexing scheduler for NPUs that reduces the temporal waste of computation resources by interweaving layer execution of multiple different models with opposing characteristics: compute-intensive and memory-intensive.
Deadline-Aware Offloading for High-Throughput Accelerators
TLDR
A novel laxity-aware scheduler (LAX) is proposed that uses information collected within the GPU to dynamically vary job priority based on how much laxity jobs have before their deadline, and meets the deadlines of concurrent, latency-sensitive GPU jobs.
SLO-Aware Inference Scheduler for Heterogeneous Processors in Edge Platforms
TLDR
A set of new heterogeneity-aware ML inference scheduling policies for edge platforms based on the regularity of computation in common ML tasks, which aims to satisfy the service-level objective (SLO) requirement while reducing the energy consumption for each request.
PARIS and ELSA: An Elastic Scheduling Algorithm for Reconfigurable Multi-GPU Inference Servers
TLDR
This paper proposes a sophisticated partitioning algorithm for reconfigurable GPUs that systematically determines a heterogeneous set of multi-granular GPU partitions, best suited for the inference server’s deployment and co-designs an elastic scheduling algorithm tailored for the heterogeneously partitioned GPU server.
VELTAIR: towards high-performance multi-tenant deep learning services via adaptive compilation and scheduling
TLDR
This work systematically analyze the opportunities and challenges of providing multi-tenant deep learning services on the general-purpose CPU architecture from the aspects of scheduling granularity and code generation, and proposes an adaptive granularity scheduling scheme to both guarantee resource usage efficiency and reduce the scheduling conflict rate.
Lazy Batching: An SLA-aware Batching System for Cloud Machine Learning Inference
TLDR
This paper proposes LazyBatching, an SLA-aware batching system that considers both scheduling and batching in the granularity of individual graph nodes, rather than the entire graph for flexible batching.
Multi-model Machine Learning Inference Serving with GPU Spatial Partitioning
TLDR
A new ML inference scheduling framework for multi-model ML inference servers is proposed and it is shown that with SLO constraints, current GPUs are not fully utilized for ML inference tasks.
Exploration of Systolic-Vector Architecture with Resource Scheduling for Dynamic ML Workloads
TLDR
A scalable systolic-vector architecture that can cope with dynamically changing DNN workloads in cloud datacenters is presented and a heterogeneity-aware scheduling algorithm is proposed that improves the throughput and energy efficiency by 81% and 20%, respectively, compared to a standard round-robin scheduling.
Enable simultaneous DNN services based on deterministic operator overlap and precise latency prediction
TLDR
Abacus enables deterministic operator overlap to enforce latency predictability and reduces 51.3% of the QoS violation and improves the throughput by 29.8% on average compared with state-of-the-art solutions.
DeepRecSys: A System for Optimizing End-To-End At-Scale Neural Recommendation Inference
TLDR
DeepRecSched is proposed, a recommendation inference scheduler that maximizes latency-bounded throughput by taking into account characteristics of inference query size and arrival patterns, model architectures, and underlying hardware systems.
...
...

References

SHOWING 1-10 OF 97 REFERENCES
Baymax: QoS Awareness and Increased Utilization for Non-Preemptive Accelerators in Warehouse Scale Computers
TLDR
Baymax is presented, a runtime system that orchestrates the execution of compute tasks from different applications and mitigates PCI-e bandwidth contention to deliver the required QoS for user-facing applications and increase the accelerator utilization.
Prophet: Precise QoS Prediction on Non-Preemptive Accelerators to Improve Utilization in Warehouse-Scale Computers
Guaranteeing Quality-of-Service (QoS) of latency-sensitive applications while improving server utilization through application co-location is important yet challenging in modern datacenters. The key
Enabling preemptive multiprogramming on GPUs
TLDR
This paper argues for preemptive multitasking and design two preemption mechanisms that can be used to implement GPU scheduling policies and extends the NVIDIA GK110 (Kepler) like GPU architecture to allow concurrent execution of GPU kernels from different user processes and implements a scheduling policy that dynamically distributes the GPU cores among concurrently running kernels, according to their priorities.
Chimera: Collaborative Preemption for Multitasking on a Shared GPU
TLDR
Chimera first introduces streaming multiprocessor flushing, which can instantly preempt an SM by detecting and exploiting idempotent execution, and utilizes flushing collaboratively with two previously proposed preemption techniques for GPUs, namely context switching and draining to minimize throughput overhead while achieving a required preemption latency.
NeuMMU: Architectural Support for Efficient Address Translations in Neural Processing Units
TLDR
A case for enabling address translation in NPUs to decouple the virtual and physical memory address space is made and a memory management unit (MMU) that is tailored for NPU's is proposed.
Enabling Efficient Preemption for SIMT Architectures with Lightweight Context Switching
  • Zhen Lin, L. Nyland, Huiyang Zhou
  • Computer Science
    SC16: International Conference for High Performance Computing, Networking, Storage and Analysis
  • 2016
TLDR
Three complementary ways to reduce and compress the architectural states to achieve lightweight context switching on SIMT processors with compiler and hardware co-design are proposed.
Compressing DMA Engine: Leveraging Activation Sparsity for Training Deep Neural Networks
TLDR
A high-performance virtualization strategy based on a "compressing DMA engine" (cDMA) that drastically reduces the size of the data structures that are targeted for CPU-side allocations is introduced.
In-datacenter performance analysis of a tensor processing unit
  • N. Jouppi, C. Young, D. Yoon
  • Computer Science
    2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)
  • 2017
TLDR
This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.
Improving GPGPU concurrency with elastic kernels
TLDR
This work studies concurrent execution of GPU kernels using multiprogram workloads on current NVIDIA Fermi GPUs, and proposes transformations that convert CUDA kernels into elastic kernels which permit fine-grained control over their resource usage.
A Case for Memory-Centric HPC System Architecture for Training Deep Neural Networks
TLDR
This work proposes a memory-centric deep learning system that can transparently expand the memory capacity accessible to the accelerators while also providing fast inter-device communication for parallel training.
...
...