Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark

  title={Analysis of DAWNBench, a Time-to-Accuracy Machine Learning Performance Benchmark},
  author={Cody A. Coleman and Daniel Kang and Deepak Narayanan and Luigi Nardi and Tian Zhao and Jian Zhang and Peter D. Bailis and Kunle Olukotun and Christopher R{\'e} and Matei A. Zaharia},
  journal={ACM SIGOPS Operating Systems Review},
  pages={14 - 25}
Researchers have proposed hardware, software, and algorithmic optimizations to improve the computational performance of deep learning. While some of these optimizations perform the same operations faster (e.g., increasing GPU clock speed), many others modify the semantics of the training procedure (e.g., reduced precision), and can impact the final model's accuracy on unseen data. Due to a lack of standard evaluation criteria that considers these trade-offs, it is difficult to directly compare… 

Figures and Tables from this paper

Jointly Optimizing Preprocessing and Inference for DNN-based Visual Analytics

This work examines end-to-end DNN execution in visual analytics systems on modern accelerators, and introduces novel methods of achieving accuracy and throughput trade-offs by using natively present, low-resolution visual data.

Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training

An optimization framework to navigate the tradeoff between energy consumption and performance optimization is proposed, Zeus, which uses an online exploration-exploitation approach in conjunction with just-in-time energy profiling, averting the need for expensive of-time measurements.

DLBricks: Composable Benchmark Generation to Reduce Deep Learning Benchmarking Effort on CPUs

DLBricks is proposed, a composable benchmark generation design that reduces the effort of developing, maintaining, and running DL benchmarks and decomposes DL models into a set of unique runnable networks and constructs the original model's performance using the performance of the generated benchmarks.

Benanza: Automatic μBenchmark Generation to Compute "Lower-bound" Latency and Inform Optimizations of Deep Learning Models on GPUs

Benanza is proposed, a sustainable and extensible benchmarking and analysis design that speeds up the characterization/optimization cycle of DL models on GPUs and identified optimizations in parallel layer execution, cuDNN convolution algorithm selection, framework inefficiency, layer fusion, and using Tensor Cores.

Communication Patterns in Distributed Deep Learning

This project trains Deep Neural Network models of various sizes on sixteen GPUs in Google Cloud Compute Engine platform and records information about the data the workers exchange as well as the timing of each iteration of training to study the communication component of training.

Evaluation of Optimized CNNs on Heterogeneous Accelerators Using a Novel Benchmarking Approach

It is shown that channel pruning is most effective and works across most hardware platforms, with speedups directly correlated to the reduction in compute load, while FPGAs benefit the most from quantization.

The Design and Implementation of a Scalable Deep Learning Benchmarking Platform

MLModelScope introduces a specification to define DL model evaluations and provides a runtime to provision the evaluation workflow using the user-specified HW/SW stack and implements MLModelScope as an open-source project with support for all major frameworks and hardware architectures.

Fast Training of Deep Learning Models over Multiple GPUs

This paper proposes FastT, a transparent module to work with the TensorFlow framework for automatically identifying a satisfying deployment and execution order of operations in DNN models over

Multi-Precision Policy Enforced Training (MuPPET): A precision-switching strategy for quantised fixed-point training of CNNs

The novel training strategy, MuPPET, combines the use of multiple number representation regimes together with a precision-switching mechanism that decides at run time the transition point between precision regimes and yields improvements in training time and energy efficiency compared to state-of-the-art approaches.

Bosch Deep Learning Hardware Benchmark

A new granularity level to evaluate common submodules of DL models, a twofold benchmark procedure that accounts for hardware and model optimizations done by HWA manufacturers, and an extended set of performance indicators that can help to identify a mismatch between a HWA and the DL models used in this benchmark are presented.



DAWNBench : An End-to-End Deep Learning Benchmark and Competition

DAWNBench is introduced, a benchmark and competition focused on end-to-end training time to achieve a state-of-the-art accuracy level, as well as inference with that accuracy, and will provide a useful, reproducible means of evaluating the many tradeoffs in deep learning systems.

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

TVM is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.

Benchmarking State-of-the-Art Deep Learning Software Tools

This paper presents an attempt to benchmark several state-of-the-art GPU-accelerated deep learning software tools, including Caffe, CNTK, TensorFlow, and Torch, and focuses on evaluating the running time performance of these tools with three popular types of neural networks on two representative CPU platforms and three representative GPU platforms.

Understanding and optimizing asynchronous low-precision stochastic gradient descent

The DMGC model is introduced, the first conceptualization of the parameter space that exists when implementing low-precision SGD, and it is shown that it provides a way to both classify these algorithms and model their performance.

Comparative Study of Deep Learning Software Frameworks

A comparative study of five deep learning frameworks, namely Caffe, Neon, TensorFlow, Theano, and Torch, on three aspects: extensibility, hardware utilization, and speed finds that Theano and Torch are the most easily extensible frameworks.

In-datacenter performance analysis of a tensor processing unit

  • N. JouppiC. Young D. Yoon
  • Computer Science
    2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA)
  • 2017
This paper evaluates a custom ASIC-called a Tensor Processing Unit (TPU)-deployed in datacenters since 2015 that accelerates the inference phase of neural networks (NN) and compares it to a server-class Intel Haswell CPU and an Nvidia K80 GPU, which are contemporaries deployed in the samedatacenters.

Beyond Data and Model Parallelism for Deep Neural Networks

A more comprehensive search space of parallelization strategies for DNNs called SOAP, which includes strategies to parallelize a DNN in the Sample, Operation, Attribute, and Parameter dimensions is defined and FlexFlow, a deep learning framework that uses guided randomized search of the SOAP space to find a fast parallelization strategy for a specific parallel machine is proposed.

TBD: Benchmarking and Analyzing Deep Neural Network Training

A new benchmark for DNN training is proposed, called TBD, that uses a representative set of DNN models that cover a wide range of machine learning applications and a new toolchain for performance analysis for these models is presented that combines the targeted usage of existing performance analysis tools, careful selection of new and existing metrics and methodologies to analyze the results, and utilization of domain specific characteristics ofDNN training.

Fathom: reference workloads for modern deep learning methods

This paper assembles Fathom: a collection of eight archetypal deep learning workloads, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook's AI research group, and focuses on understanding the fundamental performance characteristics of each model.

High-Accuracy Low-Precision Training

A simple low-precision stochastic gradient descent variant called HALP is described, which is to use SVRG to reduce gradient variance, and to combine this with a novel technique called bit centering to reduce quantization error.