Benchmarking Resource Usage for Efficient Distributed Deep Learning

  title={Benchmarking Resource Usage for Efficient Distributed Deep Learning},
  author={Nathan C Frey and Baolin Li and Joseph McDonald and Dan Zhao and Michael Jones and David Bestor and Devesh Tiwari and Vijay Gadepally and Siddharth Samsi},
  journal={2022 IEEE High Performance Extreme Computing Conference (HPEC)},
Deep learning (DL) workflows demand an ever-increasing budget of compute and energy in order to achieve outsized gains. As such, it becomes essential to understand how different deep neural networks (DNNs) and training leverage increasing compute and energy resources-especially specialized computationally-intensive models across different domains and applications. In this paper, we conduct over 3,400 experiments training an array of deep networks representing various domains/tasks-natural… 

Great Power, Great Responsibility: Recommendations for Reducing Energy for Training Language Models

This article investigates techniques that can be used to reduce the energy consumption of common NLP applications and describes the impact of these settings on metrics such as computational performance and energy consumption through experiments conducted on a high performance computing system as well as popular cloud computing platforms.

A Green(er) World for A.I.

  • Dan ZhaoNathan C Frey S. Samsi
  • Computer Science
    2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
  • 2022
A bird's eye view of various areas for potential changes and improvements from the ground floor of AI's operational and hardware optimizations for datacenter/HPCs to the current incentive structures in the world of A.I. research and practice is presented.



Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

This work performs a large-scale analysis of real-world job traces from SenseTime, and introduces a general-purpose framework, which manages resources based on historical data, about the characteristics of DL jobs and resource management.

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

A detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in a large enterprise is presented and design guidelines pertaining to next-generation cluster schedulers for DNN training workloads are provided.

Paleo: A Performance Model for Deep Neural Networks

This work introduces an analytical performance model called PALEO, which can efficiently and accurately model the expected scalability and performance of a putative deep learning system and is robust to the choice of network architecture, hardware, software, communication schemes, and parallelization strategies.

Strategies to Deploy and Scale Deep Learning on the Summit Supercomputer

It is recommended that users take a step-wise tuning approach beginning with algorithmic kernel choice, node I/O configuration, and communications tuning as best-practice for scaling up DL model training campaigns.

Fathom: reference workloads for modern deep learning methods

This paper assembles Fathom: a collection of eight archetypal deep learning workloads, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook's AI research group, and focuses on understanding the fundamental performance characteristics of each model.

Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications

Detailed characterizations of deep learning models used in many Facebook social network services are provided and the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers is highlighted.

Scalable Geometric Deep Learning on Molecular Graphs

LitMatter is presented, a lightweight framework for scaling molecular deep learning methods that quantifies the model-dependent scaling and enable optimal compute resource allocation and the identification of scalable molecular geometric deep learning model implementations.

EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks

A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet.

Horovod: fast and easy distributed deep learning in TensorFlow

Horovod is an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow.

Predicting the Computational Cost of Deep Learning Models

This work proposes an alternative approach in which a deep learning network is trained to predict the execution time for parts of adeep learning network, which has advantages over linear approaches as it can model more complex scenarios and support making a well-informed choice for the appropriate hardware and model.