# Parity Models: A General Framework for Coding-Based Resilience in ML Inference

@article{Kosaian2019ParityMA, title={Parity Models: A General Framework for Coding-Based Resilience in ML Inference}, author={Jack Kosaian and K. V. Rashmi and Shivaram Venkataraman}, journal={ArXiv}, year={2019}, volume={abs/1905.00863} }

Machine learning models are becoming the primary workhorses for many applications. Production services deploy models through prediction serving systems that take in queries and return predictions by performing inference on machine learning models. In order to scale to high query rates, prediction serving systems are run on many machines in cluster settings, and thus are prone to slowdowns and failures that inflate tail latency and cause violations of strict latency targets. Current approaches…

## Figures and Tables from this paper

## 10 Citations

Rateless Codes for Distributed Non-linear Computations

- Computer Science2021 11th International Symposium on Topics in Coding (ISTC
- 2021

This work proposes a coded computing strategy for mitigating the effect of stragglers on non-linear distributed computations and shows that erasure codes can be used to generate and compute random linear combinations of functions at the nodes such that the original function can be computed as long as a subset of nodes return their computations.

Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication

- Computer ScienceProc. ACM Meas. Anal. Comput. Syst.
- 2019

This paper proposes a rateless fountain coding strategy that achieves the best of both worlds -- it is proved that its latency is asymptotically equal to ideal load balancing, and it performs asymPTotically zero redundant computations.

Enabling Low-Redundancy Proactive Fault Tolerance for Stream Machine Learning via Erasure Coding

- Computer Science2021 40th International Symposium on Reliable Distributed Systems (SRDS)
- 2021

StreamLEC is designed, a stream machine learning system that leverages erasure coding to provide low-redundancy proactive fault tolerance for immediate failure recovery, and achieves much higher throughput than both reactive fault tolerance and replication-based proactive faultolerance, with negligible failure recovery overhead.

Slack squeeze coded computing for adaptive straggler mitigation

- Computer ScienceSC
- 2019

This paper proposes a dynamic workload distribution strategy for coded computation called Slack Squeeze Coded Computation (S2C2), which squeezes the compute slack (i.e., overhead) that is built into the coded computing frameworks by efficiently assigning work for all fast and slow nodes according to their speeds and without needing to re-distribute data.

Collage Inference: Using Coded Redundancy for Lowering Latency Variation in Distributed Image Classification Systems

- Computer Science2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)
- 2020

This work proposes the collage inference technique, which uses a novel convolutional neural network model, collage-cnn, to provide low-cost redundancy and demonstrates that the 99th percentile tail latency of inference can be reduced by 1.2x to 2x compared to replication-based approaches while providing high accuracy.

A demonstration of willump

- Computer ScienceProc. VLDB Endow.
- 2020

This demo presents Willump, an optimizer for ML inference that introduces statistically-motivated optimizations targeting ML applications whose performance bottleneck is feature computation.

Collage Inference: Using Coded Redundancy for Low Variance Distributed Image Classification

- Computer Science
- 2019

This work proposes the collage inference technique which uses a novel convolutional neural network model, collage-cnn, to provide low-cost redundancy to augment a collection of traditional single image classifier models with a single collage -cnn classifier which acts as their low- cost redundant backup.

Lightweight Projective Derivative Codes for Compressed Asynchronous Gradient Descent

- Computer ScienceICML
- 2022

This paper proposes a novel algorithm that encodes the partial derivatives themselves and furthermore optimizes the codes by performing lossy compression on the derivative codewords by maximizing the information contained in the codeword while minimizing the information between thecodewords.

Synergy via Redundancy: Adaptive Replication Strategies and Fundamental Limits

- Business, Computer ScienceIEEE/ACM Transactions on Networking
- 2021

This work seeks to find the fundamental limit of the throughput boost achieved by job replication and the optimal replication policy to achieve it, and proposes two myopic replication policies, MaxRate and AdaRep, to adaptively replicate jobs.

Rateless Sum-Recovery Codes For Distributed Non-Linear Computations

- Computer Science
- 2022

The problem of slowdown caused by straggling nodes in distributed non-linear computations is addressed and a new class of rateless codes called rateless sum-recovery codes are proposed whose aim is to recover the sum of source symbols, without necessarily recovering individual symbols.

## References

SHOWING 1-10 OF 86 REFERENCES

Learning a Code: Machine Learning for Approximate Non-Linear Coded Computation

- Computer ScienceArXiv
- 2018

This work proposes the first learning-based approach for designing codes, and presents the first coding-theoretic solution that can provide resilience for any non-linear (differentiable) computation.

InferLine: ML Inference Pipeline Composition Framework

- Computer ScienceArXiv
- 2018

InferLine is a system which efficiently provisions and executes ML inference pipelines subject to end-to-end latency constraints by proactively optimizing and reactively controlling per-model configuration in a fine-grained fashion.

Clipper: A Low-Latency Online Prediction Serving System

- Computer ScienceNSDI
- 2017

Clipper is introduced, a general-purpose low-latency prediction serving system that introduces a modular architecture to simplify model deployment across frameworks and applications and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks.

PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems

- Computer ScienceOSDI
- 2018

PRETZEL is a prediction serving system introducing a novel white box architecture enabling both end-to-end and multi-model optimizations and is on average able to reduce 99th percentile latency while reducing memory footprint, and increasing throughput.

InferLine : Prediction Pipeline Provisioning and Management for Tight Latency Objectives

- Computer Science
- 2019

InferLine is introduced, a system which provisions and executes ML prediction pipelines subject to end-to-end latency constraints by proactively optimizing and reactively controlling per-model configurations in a fine-grained fashion and generalizes across state-of-the-art model serving frameworks.

Rateless Codes for Near-Perfect Load Balancing in Distributed Matrix-Vector Multiplication

- Computer ScienceProc. ACM Meas. Anal. Comput. Syst.
- 2019

This paper proposes a rateless fountain coding strategy that achieves the best of both worlds -- it is proved that its latency is asymptotically equal to ideal load balancing, and it performs asymPTotically zero redundant computations.

Collage Inference: Tolerating Stragglers in Distributed Neural Network Inference using Coding

- Computer ScienceArXiv
- 2019

This paper proposes modified single shot object detection models, Collage-CNN models, to provide necessary resilience efficiently in distributed image classification and demonstrates that the 99th percentile latency can be reduced by 1.46X compared to replication based approaches and without compromising prediction accuracy.

Speeding Up Distributed Machine Learning Using Codes

- Computer ScienceIEEE Transactions on Information Theory
- 2018

This paper focuses on two of the most basic building blocks of distributed learning algorithms: matrix multiplication and data shuffling, and uses codes to reduce communication bottlenecks, exploiting the excess in storage.

Ernest: Efficient Performance Prediction for Large-Scale Advanced Analytics

- Computer ScienceNSDI
- 2016

Ernest, a performance prediction framework for large scale analytics, and evaluation on Amazon EC2 using several workloads shows that the prediction error is low while having a training overhead of less than 5% for long-running jobs.

DeepCPU: Serving RNN-based Deep Learning Models 10x Faster

- Computer ScienceUSENIX Annual Technical Conference
- 2018

This work characterizes RNN performance and identifies low data reuse as a root cause, and develops novel techniques and an efficient search strategy to squeeze more data reuse out of this intrinsically challenging workload.