• Corpus ID: 143422082

Parity Models: A General Framework for Coding-Based Resilience in ML Inference

@article{Kosaian2019ParityMA,
  title={Parity Models: A General Framework for Coding-Based Resilience in ML Inference},
  author={Jack Kosaian and K. V. Rashmi and Shivaram Venkataraman},
  journal={ArXiv},
  year={2019},
  volume={abs/1905.00863}
}
Machine learning models are becoming the primary workhorses for many applications. Production services deploy models through prediction serving systems that take in queries and return predictions by performing inference on machine learning models. In order to scale to high query rates, prediction serving systems are run on many machines in cluster settings, and thus are prone to slowdowns and failures that inflate tail latency and cause violations of strict latency targets. Current approaches… 
Rateless Codes for Distributed Non-linear Computations
TLDR
This work proposes a coded computing strategy for mitigating the effect of stragglers on non-linear distributed computations and shows that erasure codes can be used to generate and compute random linear combinations of functions at the nodes such that the original function can be computed as long as a subset of nodes return their computations.
Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication
TLDR
This paper proposes a rateless fountain coding strategy that achieves the best of both worlds -- it is proved that its latency is asymptotically equal to ideal load balancing, and it performs asymPTotically zero redundant computations.
Enabling Low-Redundancy Proactive Fault Tolerance for Stream Machine Learning via Erasure Coding
TLDR
StreamLEC is designed, a stream machine learning system that leverages erasure coding to provide low-redundancy proactive fault tolerance for immediate failure recovery, and achieves much higher throughput than both reactive fault tolerance and replication-based proactive faultolerance, with negligible failure recovery overhead.
Slack squeeze coded computing for adaptive straggler mitigation
TLDR
This paper proposes a dynamic workload distribution strategy for coded computation called Slack Squeeze Coded Computation (S2C2), which squeezes the compute slack (i.e., overhead) that is built into the coded computing frameworks by efficiently assigning work for all fast and slow nodes according to their speeds and without needing to re-distribute data.
Collage Inference: Using Coded Redundancy for Lowering Latency Variation in Distributed Image Classification Systems
TLDR
This work proposes the collage inference technique, which uses a novel convolutional neural network model, collage-cnn, to provide low-cost redundancy and demonstrates that the 99th percentile tail latency of inference can be reduced by 1.2x to 2x compared to replication-based approaches while providing high accuracy.
A demonstration of willump
TLDR
This demo presents Willump, an optimizer for ML inference that introduces statistically-motivated optimizations targeting ML applications whose performance bottleneck is feature computation.
Collage Inference: Using Coded Redundancy for Low Variance Distributed Image Classification
TLDR
This work proposes the collage inference technique which uses a novel convolutional neural network model, collage-cnn, to provide low-cost redundancy to augment a collection of traditional single image classifier models with a single collage -cnn classifier which acts as their low- cost redundant backup.
Lightweight Projective Derivative Codes for Compressed Asynchronous Gradient Descent
TLDR
This paper proposes a novel algorithm that encodes the partial derivatives themselves and furthermore optimizes the codes by performing lossy compression on the derivative codewords by maximizing the information contained in the codeword while minimizing the information between thecodewords.
Synergy via Redundancy: Adaptive Replication Strategies and Fundamental Limits
TLDR
This work seeks to find the fundamental limit of the throughput boost achieved by job replication and the optimal replication policy to achieve it, and proposes two myopic replication policies, MaxRate and AdaRep, to adaptively replicate jobs.
Rateless Sum-Recovery Codes For Distributed Non-Linear Computations
TLDR
The problem of slowdown caused by straggling nodes in distributed non-linear computations is addressed and a new class of rateless codes called rateless sum-recovery codes are proposed whose aim is to recover the sum of source symbols, without necessarily recovering individual symbols.

References

SHOWING 1-10 OF 86 REFERENCES
Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation
TLDR
This work proposes the first learning-based approach for designing codes, and presents the first coding-theoretic solution that can provide resilience for any non-linear (differentiable) computation.
InferLine: ML Inference Pipeline Composition Framework
TLDR
InferLine is a system which efficiently provisions and executes ML inference pipelines subject to end-to-end latency constraints by proactively optimizing and reactively controlling per-model configuration in a fine-grained fashion.
Clipper: A Low-Latency Online Prediction Serving System
TLDR
Clipper is introduced, a general-purpose low-latency prediction serving system that introduces a modular architecture to simplify model deployment across frameworks and applications and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks.
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
TLDR
PRETZEL is a prediction serving system introducing a novel white box architecture enabling both end-to-end and multi-model optimizations and is on average able to reduce 99th percentile latency while reducing memory footprint, and increasing throughput.
InferLine : Prediction Pipeline Provisioning and Management for Tight Latency Objectives
TLDR
InferLine is introduced, a system which provisions and executes ML prediction pipelines subject to end-to-end latency constraints by proactively optimizing and reactively controlling per-model configurations in a fine-grained fashion and generalizes across state-of-the-art model serving frameworks.
Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication
TLDR
This paper proposes a rateless fountain coding strategy that achieves the best of both worlds -- it is proved that its latency is asymptotically equal to ideal load balancing, and it performs asymPTotically zero redundant computations.
Collage Inference: Tolerating Stragglers in Distributed Neural Network Inference using Coding
TLDR
This paper proposes modified single shot object detection models, Collage-CNN models, to provide necessary resilience efficiently in distributed image classification and demonstrates that the 99th percentile latency can be reduced by 1.46X compared to replication based approaches and without compromising prediction accuracy.
Speeding Up Distributed Machine Learning Using Codes
TLDR
This paper focuses on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling, and uses codes to reduce communication bottlenecks, exploiting the excess in storage.
Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics
TLDR
Ernest, a performance prediction framework for large scale analytics, and evaluation on Amazon EC2 using several workloads shows that the prediction error is low while having a training overhead of less than 5% for long-running jobs.
DeepCPU: Serving RNN-based Deep Learning Models 10x Faster
TLDR
This work characterizes RNN performance and identifies low data reuse as a root cause, and develops novel techniques and an efficient search strategy to squeeze more data reuse out of this intrinsically challenging workload.
...
...