Orloj: Predictably Serving Unpredictable DNNs
@article{Yu2022OrlojPS, title={Orloj: Predictably Serving Unpredictable DNNs}, author={Peifeng Yu and Yuqing Qiu and Xin Jin and Mosharaf Chowdhury}, journal={ArXiv}, year={2022}, volume={abs/2209.00159} }
Existing DNN serving solutions can provide tight latency SLOs while maintaining high throughput via careful scheduling of incoming requests, whose execution times are assumed to be highly predictable and data-independent. However, inference requests to emerging dynamic DNNs – e.g., popular natural language processing (NLP) models and computer vision (CV) models that skip layers – are data-dependent . They exhibit poor performance when served using existing solutions because they experience…
Figures and Tables from this paper
55 References
Serving DNNs like Clockwork: Performance Predictability from the Bottom Up
- 2020
Computer Science
OSDI
This work adopts a principled design methodology to successively build a fully distributed model serving system that achieves predictable end-to-end performance and demonstrates that Clockwork exploits predictable execution times to achieve tight request- level service-level objectives (SLOs) as well as a high degree of request-level performance isolation.
Tiresias: A GPU Cluster Manager for Distributed Deep Learning
- 2019
Computer Science
NSDI
This work presents Tiresias, a GPU cluster manager tailored for distributed DL training jobs, which schedules and places DL jobs to reduce their job completion times (JCT), and proposes two scheduling algorithms that aim to minimize the average JCT.
BARISTA: Efficient and Scalable Serverless Serving System for Deep Learning Prediction Services
- 2019
Computer Science
2019 IEEE International Conference on Cloud Engineering (IC2E)
This work presents a distributed and scalable deep-learning prediction serving system called Barista, and proposes an intelligent agent to allocate and manage the compute resources by horizontal and vertical scaling to maintain the required prediction latency.
InferLine: latency-aware provisioning and scaling for prediction serving pipelines
- 2020
Computer Science
SoCC
This paper introduces InferLine, a system which provisions and manages the individual stages of prediction pipelines to meet end-to-end tail latency constraints while minimizing cost and generalizes across state-of-the-art model serving frameworks.
MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving
- 2019
Computer Science
USENIX Annual Technical Conference
This paper tackles the dual challenge of SLO compliance and cost effectiveness with MArk (Model Ark), a general-purpose inference serving system built in Amazon Web Services (AWS), and evaluated the performance of MArk using several state-of-the-art ML models trained in popular frameworks including TensorFlow, MXNet, and Keras.
INFaaS: A Model-less and Managed Inference Serving System
- 2019
Computer Science
INFaaS is introduced, a managed and model-less system for distributed inference serving, where developers simply specify the performance and accuracy requirements for their applications without needing to specify a specific model-variant for each query.
Clipper: A Low-Latency Online Prediction Serving System
- 2017
Computer Science
NSDI
Clipper is introduced, a general-purpose low-latency prediction serving system that introduces a modular architecture to simplify model deployment across frameworks and applications and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks.
Nexus: a GPU cluster engine for accelerating DNN-based video analysis
- 2019
Computer Science
SOSP
Nexus is a fully implemented system that includes cluster-scale resource management that performs detailed scheduling of GPUs, reasoning about groups of DNN invocations that need to be co-scheduled, and moving from the conventional whole-DNN execution model to executing fragments ofDNNs.
3Sigma: distribution-based cluster scheduling for runtime uncertainty
- 2018
Computer Science
EuroSys
Analysis of job traces from three different large-scale cluster environments shows that, while the runtimes of many jobs can be predicted well, even state-of-the-art predictors have wide error profiles, and the performance of 3Sigma approaches the end-to-end performance of a scheduler based on a hypothetical, perfect runtime predictor.
Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider
- 2020
Computer Science
USENIX Annual Technical Conference
This paper first characterize the entire production FaaS workload of Azure Functions, and proposes a practical resource management policy that significantly reduces the number of function cold starts, while spending fewer resources than state-of-the-practice policies.