Orloj: Predictably Serving Unpredictable DNNs

  title={Orloj: Predictably Serving Unpredictable DNNs},
  author={Peifeng Yu and Yuqing Qiu and Xin Jin and Mosharaf Chowdhury},
Existing DNN serving solutions can provide tight latency SLOs while maintaining high throughput via careful scheduling of incoming requests, whose execution times are assumed to be highly predictable and data-independent. However, inference requests to emerging dynamic DNNs – e.g., popular natural language processing (NLP) models and computer vision (CV) models that skip layers – are data-dependent . They exhibit poor performance when served using existing solutions because they experience… 

Serving DNNs like Clockwork: Performance Predictability from the Bottom Up

This work adopts a principled design methodology to successively build a fully distributed model serving system that achieves predictable end-to-end performance and demonstrates that Clockwork exploits predictable execution times to achieve tight request- level service-level objectives (SLOs) as well as a high degree of request-level performance isolation.

Tiresias: A GPU Cluster Manager for Distributed Deep Learning

This work presents Tiresias, a GPU cluster manager tailored for distributed DL training jobs, which schedules and places DL jobs to reduce their job completion times (JCT), and proposes two scheduling algorithms that aim to minimize the average JCT.

BARISTA: Efficient and Scalable Serverless Serving System for Deep Learning Prediction Services

This work presents a distributed and scalable deep-learning prediction serving system called Barista, and proposes an intelligent agent to allocate and manage the compute resources by horizontal and vertical scaling to maintain the required prediction latency.

InferLine: latency-aware provisioning and scaling for prediction serving pipelines

This paper introduces InferLine, a system which provisions and manages the individual stages of prediction pipelines to meet end-to-end tail latency constraints while minimizing cost and generalizes across state-of-the-art model serving frameworks.

MArk: Exploiting Cloud Services for Cost-Effective, SLO-Aware Machine Learning Inference Serving

This paper tackles the dual challenge of SLO compliance and cost effectiveness with MArk (Model Ark), a general-purpose inference serving system built in Amazon Web Services (AWS), and evaluated the performance of MArk using several state-of-the-art ML models trained in popular frameworks including TensorFlow, MXNet, and Keras.

INFaaS: A Model-less and Managed Inference Serving System

INFaaS is introduced, a managed and model-less system for distributed inference serving, where developers simply specify the performance and accuracy requirements for their applications without needing to specify a specific model-variant for each query.

Clipper: A Low-Latency Online Prediction Serving System

Clipper is introduced, a general-purpose low-latency prediction serving system that introduces a modular architecture to simplify model deployment across frameworks and applications and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks.

Nexus: a GPU cluster engine for accelerating DNN-based video analysis

Nexus is a fully implemented system that includes cluster-scale resource management that performs detailed scheduling of GPUs, reasoning about groups of DNN invocations that need to be co-scheduled, and moving from the conventional whole-DNN execution model to executing fragments ofDNNs.

3Sigma: distribution-based cluster scheduling for runtime uncertainty

Analysis of job traces from three different large-scale cluster environments shows that, while the runtimes of many jobs can be predicted well, even state-of-the-art predictors have wide error profiles, and the performance of 3Sigma approaches the end-to-end performance of a scheduler based on a hypothetical, perfect runtime predictor.

Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider

This paper first characterize the entire production FaaS workload of Azure Functions, and proposes a practical resource management policy that significantly reduces the number of function cold starts, while spending fewer resources than state-of-the-practice policies.