Quantifying and Improving Performance of Distributed Deep Learning with Cloud Storage

  title={Quantifying and Improving Performance of Distributed Deep Learning with Cloud Storage},
  author={Nicholas Krichevsky and Matthew St Louis and Tian Guo},
  journal={2021 IEEE International Conference on Cloud Engineering (IC2E)},
Cloud computing provides a powerful yet low-cost environment for distributed deep learning workloads. However, training complex deep learning models often requires accessing large amounts of data, which can easily exceed the capacity of local disks. Prior research often overlooks this training data problem by implicitly assuming that data is available locally or via low latency network-based data storage. Such implicit assumptions often do not hold in a cloud-based training environment, where… 

Figures and Tables from this paper

Exploiting CXL-based Memory for Distributed Deep Learning

A framework is proposed, called DeepMemoryDL, that manages the allocation of additional CXL-based memory, introduces a fast intermediate storage tier, and provides intelligent prefetching and caching mechanisms for DL workloads to reduce the overall training time while enabling DL jobs to efficiently train models using data that is much larger than the installed system memory.

Enabling Deep Learning for All-in EDGE paradigm

The key performance metrics for Deep Learning at the All-in EDGE paradigm are presented to evaluate various deep learning techniques and choose a suitable design to overcome difficulties due to other requirements such as high computation, high latency, and high bandwidth caused by Deep Learning applications in real-world scenarios.

Enabling All In-Edge Deep Learning: A Literature Review

This survey paper focuses primarily on the fifth level of EI, called all in-edge level, where DL training and inference (deployment) are performed solely by edge servers, which is suitable when the end devices have low computing resources, e.g., Internet-of-Things.

Data science and Machine learning in the Clouds: A Perspective for the Future

The rise of paradigms like approximate computing, quantum computing and many more in recent times and their applicability in big data processing, data science, analytics, prediction and machine learning in the cloud environments are discussed.

Deployment of Image to Text Web Translator using Deep Learning on Cloud

Deep Learning with the integration of other technologies becomes capable of being that kind of translator by translating the gesture language to words and vice versa.



Characterizing and Modeling Distributed Training with Transient Cloud GPU Servers

  • Shijian LiR. WallsTian Guo
  • Computer Science
    2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)
  • 2020
This work analyzes distributed training performance under diverse cluster configurations using CM-DARE, a cloud-based measurement and training framework and demonstrates the feasibility of predicting training speed and overhead using regression-based models.

Hoard: A Distributed Data Caching System to Accelerate Deep Learning Training on the Cloud

Hard, using two NVMe disks per node and a distributed file system for caching, achieves a 2.1x speed-up over a 10Gb/s NFS central storage system on a 16 GPU (4 nodes, 4 GPUs per node) cluster for a challenging AlexNet ImageNet image classification benchmark.

Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems

Performance models that quantify the impact of partitioning and provisioning decisions on overall distributed system performance and scalability and a scalability optimizer that efficiently determines the optimal system configuration that minimizes DNN training time are developed.

DDLBench: Towards a Scalable Benchmarking Infrastructure for Distributed Deep Learning

An in-depth comparison of three different parallelism models that address the needs of distributed deep learning, both in terms of computation time and memory usage and introduces DDLBench, a comprehensive benchmark suite to quantify these differences in practice.

JPAS: Job-progress-aware flow scheduling for deep learning clusters

Distributed Machine Learning with a Serverless Architecture

This paper proposes SIREN, an asynchronous distributed machine learning framework based on the emerging serverless architecture, with which stateless functions can be executed in the cloud without the complexity of building and maintaining virtual machine infrastructures.

A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning

Deep500 is the first customizable benchmarking infrastructure that enables fair comparison of the plethora of deep learning frameworks, algorithms, libraries, and techniques and provides software infrastructure to utilize the most powerful supercomputers for extreme-scale workloads.

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

  • S. ShiXiaowen Chu
  • Computer Science
    2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech)
  • 2018
This study evaluates the running performance of four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet, and TensorFlow) over single-GPU, multi- GPU, and multi-node environments and identifies bottlenecks and overheads which could be further optimized.

Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

This paper presents a distributed paradigm on the parameter server framework called Dynamic Stale Synchronous Parallel (DSSP) which improves the state-of-the-art SSP paradigm by dynamically determining the staleness threshold at the run time by adapting the threshold per iteration at running time.

Predicting statistics of asynchronous SGD parameters for a large-scale distributed deep learning system on GPU supercomputers

A performance model of a distributed DCNN training system called SPRINT is proposed that uses asynchronous GPU processing based on mini-batch SGD training and can steadily choose the fastest machine configuration that nearly meets a targetmini-batch size on several supercomputers with up to thousands of GPUs.