Benchmarking Resource Usage for Efficient Distributed Deep Learning
@article{Frey2022BenchmarkingRU, title={Benchmarking Resource Usage for Efficient Distributed Deep Learning}, author={Nathan C Frey and Baolin Li and Joseph McDonald and Dan Zhao and Michael Jones and David Bestor and Devesh Tiwari and Vijay Gadepally and Siddharth Samsi}, journal={2022 IEEE High Performance Extreme Computing Conference (HPEC)}, year={2022}, pages={1-8} }
Deep learning (DL) workflows demand an ever-increasing budget of compute and energy in order to achieve outsized gains. As such, it becomes essential to understand how different deep neural networks (DNNs) and training leverage increasing compute and energy resources-especially specialized computationally-intensive models across different domains and applications. In this paper, we conduct over 3,400 experiments training an array of deep networks representing various domains/tasks-natural…
Figures and Tables from this paper
2 Citations
Great Power, Great Responsibility: Recommendations for Reducing Energy for Training Language Models
- Computer ScienceNAACL-HLT
- 2022
This article investigates techniques that can be used to reduce the energy consumption of common NLP applications and describes the impact of these settings on metrics such as computational performance and energy consumption through experiments conducted on a high performance computing system as well as popular cloud computing platforms.
A Green(er) World for A.I.
- Computer Science2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
- 2022
A bird's eye view of various areas for potential changes and improvements from the ground floor of AI's operational and hardware optimizations for datacenter/HPCs to the current incentive structures in the world of A.I. research and practice is presented.
References
SHOWING 1-10 OF 48 REFERENCES
Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters
- Computer ScienceSC21: International Conference for High Performance Computing, Networking, Storage and Analysis
- 2021
This work performs a large-scale analysis of real-world job traces from SenseTime, and introduces a general-purpose framework, which manages resources based on historical data, about the characteristics of DL jobs and resource management.
Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads
- Computer ScienceUSENIX Annual Technical Conference
- 2019
A detailed workload characterization of a two-month long trace from a multi-tenant GPU cluster in a large enterprise is presented and design guidelines pertaining to next-generation cluster schedulers for DNN training workloads are provided.
Paleo: A Performance Model for Deep Neural Networks
- Computer ScienceICLR
- 2017
This work introduces an analytical performance model called PALEO, which can efficiently and accurately model the expected scalability and performance of a putative deep learning system and is robust to the choice of network architecture, hardware, software, communication schemes, and parallelization strategies.
Strategies to Deploy and Scale Deep Learning on the Summit Supercomputer
- Computer Science2019 IEEE/ACM Third Workshop on Deep Learning on Supercomputers (DLS)
- 2019
It is recommended that users take a step-wise tuning approach beginning with algorithmic kernel choice, node I/O configuration, and communications tuning as best-practice for scaling up DL model training campaigns.
Fathom: reference workloads for modern deep learning methods
- Computer Science2016 IEEE International Symposium on Workload Characterization (IISWC)
- 2016
This paper assembles Fathom: a collection of eight archetypal deep learning workloads, ranging from the familiar deep convolutional neural network of Krizhevsky et al., to the more exotic memory networks from Facebook's AI research group, and focuses on understanding the fundamental performance characteristics of each model.
Deep Learning Inference in Facebook Data Centers: Characterization, Performance Optimizations and Hardware Implications
- Computer ScienceArXiv
- 2018
Detailed characterizations of deep learning models used in many Facebook social network services are provided and the need for better co-design of algorithms, numerics and computing platforms to address the challenges of workloads often run in data centers is highlighted.
Scalable Geometric Deep Learning on Molecular Graphs
- Computer ScienceArXiv
- 2021
LitMatter is presented, a lightweight framework for scaling molecular deep learning methods that quantifies the model-dependent scaling and enable optimal compute resource allocation and the identification of scalable molecular geometric deep learning model implementations.
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
- Computer ScienceICML
- 2019
A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet.
Horovod: fast and easy distributed deep learning in TensorFlow
- Computer ScienceArXiv
- 2018
Horovod is an open source library that improves on both obstructions to scaling: it employs efficient inter-GPU communication via ring reduction and requires only a few lines of modification to user code, enabling faster, easier distributed training in TensorFlow.
Predicting the Computational Cost of Deep Learning Models
- Computer Science2018 IEEE International Conference on Big Data (Big Data)
- 2018
This work proposes an alternative approach in which a deep learning network is trained to predict the execution time for parts of adeep learning network, which has advantages over linear approaches as it can model more complex scenarios and support making a well-informed choice for the appropriate hardware and model.