Exploiting Parallelism Opportunities with Deep Learning Frameworks

  title={Exploiting Parallelism Opportunities with Deep Learning Frameworks},
  author={Yu Emma Wang and Carole-Jean Wu and Xiaodong Wang and Kim M. Hazelwood and David M. Brooks},
  journal={ACM Transactions on Architecture and Code Optimization (TACO)},
  pages={1 - 23}
State-of-the-art machine learning frameworks support a wide variety of design features to enable a flexible machine learning programming interface and to ease the programmability burden on machine learning developers. Identifying and using a performance-optimal setting in feature-rich frameworks, however, involves a non-trivial amount of performance profiling efforts and often relies on domain-specific knowledge. This article takes a deep dive into analyzing the performance impact of key design… 

Array languages make neural networks fast

This paper investigates a direct implementation of a state of the art Convolutional Neural Network (CNN) in an array language, and the resulting specification is written in a rank-polymorphic data-parallel style, and it can be immediately leveraged by optimising compilers.

Automatic Tuning of Tensorflow's CPU Backend using Gradient-Free Optimization Algorithms

This paper treats the problem of tuning parameters of DL frameworks to improve training and inference performance as a black-box optimization problem, and investigates applicability and effectiveness of Bayesian optimization, genetic algorithm, and Nelder-Mead simplex to tune the parameters of TensorFlow’s CPU backend.

GEVO-ML: a proposal for optimizing ML code with evolutionary computation

GEVO-ML is proposed, a tool for automatically discovering optimization opportunities and tuning the performance of ML kernels by focusing directly on ML frameworks, intermediate languages, and target architectures.

Nondeterministic Impact of CPU Multithreading on Training Deep Learning Systems

The first work of studying the variance and robustness of DL systems impacted by CPU multithreading is presented, and an experimental framework based on VirtualBox is presented for analyzing the impact of CPU multi-threading on training DL systems.

MLPerf™ HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems

  • S. FarrellM. Emani Junqi Yin
  • Computer Science
    2021 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC)
  • 2021
This paper introduces MLPerf HPC, a benchmark suite of large-scale scientific machine learning training applications, driven by the MLCommons™ Association, and presents the results from the first submission round including a diverse set of some of the world’s largest HPC systems.

Optimizing Inference Performance of Transformers on CPUs

Focusing on the highly popular BERT model, this paper identifies key components of the Transformer architecture where the bulk of the computation happens, and proposes an Adaptive Linear Module Optimization (ALMO) to speed them up.

SMAUG: End-to-End Full-Stack Simulation Infrastructure for Deep Learning Workloads

SMAUG is presented, the first DNN framework that is purpose-built for simulation of end-to-end deep learning applications and offers researchers a wide range of capabilities for evaluating DNN workloads, from diverse network topologies to easy accelerator modeling and SoC integration.

RecShard: statistical feature-based memory optimization for industry-scale neural recommendation

RecShard determines an optimal EMB sharding strategy for a set of EMBs based on training data distributions and model characteristics, along with the bandwidth characteristics of the underlying tiered memory hierarchy, which achieves over 6 times higher EMB training throughput for capacity constrained DLRMs.

AutoScale: Energy Efficiency Optimization for Stochastic Edge Inference Using Reinforcement Learning

This paper proposes AutoScale, an adaptive and lightweight execution scaling engine built on the custom-designed reinforcement learning algorithm that continuously learns and selects the most energy efficient inference execution target by considering characteristics of neural networks and available systems in the collaborative cloud-edge execution environment while adapting to stochastic runtime variance.

AutoFL: Enabling Heterogeneity-Aware Energy Efficient Federated Learning

AutoFL is proposed by tailor-designing a reinforcement learning algorithm that learns and determines which K participant devices and per-device execution targets for each FL model aggregation round in the presence of stochastic runtime variance, system and data heterogeneity, and achieves 3.6 times faster model convergence time and 4.2 times higher energy efficiency.



Benchmarking State-of-the-Art Deep Learning Software Tools

This paper presents an attempt to benchmark several state-of-the-art GPU-accelerated deep learning software tools, including Caffe, CNTK, TensorFlow, and Torch, and focuses on evaluating the running time performance of these tools with three popular types of neural networks on two representative CPU platforms and three representative GPU platforms.

Comparative Study of Caffe, Neon, Theano, and Torch for Deep Learning

A comparative study of four deep learning frameworks, namely Caffe, Neon, Theano, and Torch, on three aspects: extensibility, hardware utilization, and speed finds that Theano and Torch are the most easily extensible frameworks.

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

TVM is a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends and automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations.

Auto-Tuning TensorFlow Threading Model for CPU Backend

  • N. Hasabnis
  • Computer Science
    2018 IEEE/ACM Machine Learning in HPC Environments (MLHPC)
  • 2018
This paper develops an automatic approach, called TensorTuner, to search for optimal parameter settings of TensorFlow's threading model for CPU backends, and evaluates it on both Eigen and Intel's MKL CPUs backends using a set of neural networks from Tensor Flow's benchmarking suite.

Tensor Comprehensions: Framework-Agnostic High-Performance Machine Learning Abstractions

A language close to the mathematics of deep learning called Tensor Comprehensions offering both imperative and declarative styles, a polyhedral Just-In-Time compiler to convert a mathematical description of a deep learning DAG into a CUDA kernel with delegated memory management and synchronization, and a compilation cache populated by an autotuner are contributed.

Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters

An in-depth performance characterization of state-of-the-art DNNs such as ResNet(s) and Inception-v3/v4 on multiple CPU architectures including Intel Xeon Broadwell, three variants of the Intel Xeon Skylake, AMD EPYC, and NVIDIA GPUs like K80, P100, and V100 is provided.

Deep Learning Recommendation Model for Personalization and Recommendation Systems

A state-of-the-art deep learning recommendation model (DLRM) is developed and its implementation in both PyTorch and Caffe2 frameworks is provided and a specialized parallelization scheme utilizing model parallelism on the embedding tables to mitigate memory constraints while exploiting data parallelism to scale-out compute from the fully-connected layers is designed.

Machine Learning Systems are Stuck in a Rut

This paper explains how the evolution of hardware accelerators favors compiler back ends that hyper-optimize large monolithic kernels, and shows how this reliance on high-performance but inflexible kernels reinforces the dominant style of programming model.

MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems

The API design and the system implementation of MXNet are described, and it is explained how embedding of both symbolic expression and tensor operation is handled in a unified fashion.

Caffe: Convolutional Architecture for Fast Feature Embedding

Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures.