• Corpus ID: 231639254

Accelerating Deep Learning Inference via Learned Caches

  title={Accelerating Deep Learning Inference via Learned Caches},
  author={Arjun Balasubramanian and Adarsh Kumar and Yuhan Liu and Han K. Cao and Shivaram Venkataraman and Aditya Akella},
Deep Neural Networks (DNNs) are witnessing increased adoption in multiple domains owing to their high accuracy in solving real-world problems. However, this high accuracy has been achieved by building deeper networks, posing a fundamental challenge to the low latency inference desired by user-facing applications. Current low latency solutions trade-off on accuracy or fail to exploit the inherent temporal locality in prediction serving workloads. We observe that caching hidden layer outputs of… 
Efficient DNN Training with Knowledge-Guided Layer Freezing
KGT is designed, a knowledge-guided DNN training system that employs semantic knowledge from a reference model to accurately evaluate individual layers’ training plasticity and safely freeze the converged ones, saving their corresponding backward computation and communication.
Pythia: A Customizable Hardware Prefetching Framework Using Online Reinforcement Learning
A case is made for designing a holistic prefetch algorithm that learns to prefetch using multiple different types of program context and system-level feedback information inherent to its design, and for proposing Pythia, which formulates the prefetcher as a reinforcement learning agent.
Enabling Deep Learning for All-in EDGE paradigm
The key performance metrics for Deep Learning at the All-in EDGE paradigm are presented to evaluate various deep learning techniques and choose a suitable design to overcome difficulties due to other requirements such as high computation, high latency, and high bandwidth caused by Deep Learning applications in real-world scenarios.
Serving DNN Models with Multi-Instance GPUs: A Case of the Reconfigurable Machine Scheduling Problem
MIGserving is an algorithm pipeline that blends a variety of newly designed algorithms and customized classic algorithms, including a heuristic greedy algorithm, Genetic Algorithm (GA), and Monte Carlo Tree Search algorithm (MCTS), and is implemented on Kubernetes.
A Survey of Machine Learning-Based System Performance Optimization Techniques
This survey provides a detailed design and summarizes model, input, output, and prediction method of each approach of machine learning-based system performance optimization approaches based on well-known machine learning models such as perceptron, LSTM, and RNN.


Accelerating Deep Learning Inference via Freezing
It is observed that caching intermediate layer outputs can help to avoid running all the layers of a DNN for a sizeable fraction of inference requests, and this system is presented, a system that introduces approximate caching at each intermediate layer and techniques to reduce the cache size and improve the cache hit rate.
Applying Deep Learning to the Cache Replacement Problem
This paper shows that for cache replacement, a powerful LSTM learning model can in an offline setting provide better accuracy than current hardware predictors, and designs a simple online model that matches the offline model's accuracy with orders of magnitude lower cost.
DeepCPU: Serving RNN-based Deep Learning Models 10x Faster
This work characterizes RNN performance and identifies low data reuse as a root cause, and develops novel techniques and an efficient search strategy to squeeze more data reuse out of this intrinsically challenging workload.
Clipper: A Low-Latency Online Prediction Serving System
Clipper is introduced, a general-purpose low-latency prediction serving system that introduces a modular architecture to simplify model deployment across frameworks and applications and improves prediction throughput, accuracy, and robustness without modifying the underlying machine learning frameworks.
Low latency RNN inference with cellular batching
The technique of cellular batching is proposed, which improves both the latency and throughput of RNN inference and achieves much lower latency and also higher throughput than existing systems.
TVM: An Automated End-to-End Optimizing Compiler for Deep Learning
TVM is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.
PRETZEL: Opening the Black Box of Machine Learning Prediction Serving Systems
PRETZEL is a prediction serving system introducing a novel white box architecture enabling both end-to-end and multi-model optimizations and is on average able to reduce 99th percentile latency while reducing memory footprint, and increasing throughput.
SkipNet: Learning Dynamic Routing in Convolutional Networks
This work introduces SkipNet, a modified residual network, that uses a gating network to selectively skip convolutional blocks based on the activations of the previous layer, and proposes a hybrid learning algorithm that combines supervised learning and reinforcement learning to address the challenges of non-differentiable skipping decisions.
Parity models: erasure-coded resilience for prediction serving systems
Using parity models, ParM, a prediction serving system that makes use of erasure-coded resilience, reduces the gap between 99.9th percentile and median latency by up to 3.5X, while maintaining the same median.
Nexus: a GPU cluster engine for accelerating DNN-based video analysis
Nexus is a fully implemented system that includes cluster-scale resource management that performs detailed scheduling of GPUs, reasoning about groups of DNN invocations that need to be co-scheduled, and moving from the conventional whole-DNN execution model to executing fragments ofDNNs.