• Corpus ID: 4782856

GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent

  title={GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent},
  author={Jeff A. Daily and Abhinav Vishnu and Charles Martin Siegel and Thomas E. Warfel and Vinay C. Amatya},
In this paper, we present GossipGraD - a gossip communication protocol based Stochastic Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale systems. The salient features of GossipGraD are: 1) reduction in overall communication complexity from {\Theta}(log(p)) for p compute nodes in well-studied SGD to O(1), 2) model diffusion such that compute nodes exchange their updates (gradients) indirectly after every log(p) steps, 3) rotation of communication partners… 
Accelerating Gossip-Based Deep Learning in Heterogeneous Edge Computing Platforms
The proposed EdgeGossip, a framework specifically designed to accelerate the training process of decentralized and Gossip-based DL training for heterogeneous EC platforms, is implemented based on popular Gossip algorithms and demonstrates its effectiveness using real-world DL workloads.
Communication Scheduling for Gossip SGD in a Wide Area Network
A type of gossip SGD in which computation and communication overlap to accelerate learning is proposed, which is effective in both homogeneous and heterogeneous networks.
Priority-based parameter propagation for distributed deep neural network training
This work proposes a new synchronization mechanism called Priority-based Parameter Propagation (P3), which synchronizes parameters at a finer granularity and schedules data transmission in such a way that the training process incurs minimal communication delay.
Decentralized trustless gossip training of deep neural networks
This work proposes a novel protocol for exchanging the model knowledge between peers using a gossip algorithm combined with the stochastic gradient descent (SGD), which has the advantage of being fully asynchronous, decentralized, trustless, and independent of the network size and the churn ratio.
EventGraD: Event-Triggered Communication in Parallel Machine Learning
Efficient Asynchronous GCN Training on a GPU Cluster
  • Y. Zhang, D. Goswami
  • Computer Science
    2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)
  • 2021
This research investigates approaches for asynchronous decentralized parallel training of GCNs on a GPU cluster based on graph clustering and the Gossip protocol and demonstrates superior performance with similar accuracy scores, as compared to traditional synchronous training which uses “all reduce” to synchronously accumulate parallel training results.
Priority-based Parameter Propagation for Distributed DNN Training
This paper proposes a new synchronization mechanism called Priority-based Parameter Propagation (P3), which synchronizes parameters at a finer granularity and schedules data transmission in such a way that the training process incurs minimal communication delay.
Addressing the Heterogeneity of A Wide Area Network for DNNs
It is shown that the congestion problem can be solved by adjusting the communication frequency, that is, by training multiple times and communicating once, and a warm-up technique is proposed to improve the learning efficiency.
EventGraD: Event-Triggered Communication in Parallel Stochastic Gradient Descent
  • Soumyadip Ghosh, V. Gupta
  • Computer Science
    2020 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) and Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S)
  • 2020
This paper focuses on data-parallel training of a popular convolutional neural network used for training the MNIST dataset and shows that EventGraD can reduce the communication load by up to 70% while retaining the same level of accuracy.
Demystifying Parallel and Distributed Deep Learning
The problem of parallelization in DNNs is described from a theoretical perspective, followed by approaches for its parallelization, and potential directions for parallelism in deep learning are extrapolated.


Large Scale Distributed Deep Networks
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.
Distributed Deep Learning Using Synchronous Stochastic Gradient Descent
A distributed multinode synchronous SGD algorithm is designed and implemented, without altering hyper parameters, or compressing data, or altering algorithmic behavior, and the generality of this approach is demonstrated via best-in-class 6.5X scaling for a 7-layer DNN on 16 nodes.
Parle: parallelizing stochastic gradient descent
We propose a new algorithm called Parle for parallel training of deep networks that converges 2-4x faster than a data-parallel implementation of SGD, while achieving significantly improved error
A software-hardware co-optimized distributed Deep Learning system that can achieve near-linear scaling up to hundreds of GPUs using a multi-ring communication pattern that provides a good tradeoff between latency and bandwidth and adapts to a variety of system configurations.
GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server
GeePS enables a state-of-the-art single-node GPU implementation to scale well, such as to 13 times the number of training images processed per second on 16 machines (relative to the original optimized single- node code), and achieves a higher training throughput with just four GPU machines than that a state of theart CPU-only system achieves with 108 machines.
Greedy Layer-Wise Training of Deep Networks
These experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.
Large Batch Training of Convolutional Networks
It is argued that the current recipe for large batch training (linear learning rate scaling with warm-up) is not general enough and training may diverge and a new training algorithm based on Layer-wise Adaptive Rate Scaling (LARS) is proposed.
FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters
FireCaffe is presented, which successfully scales deep neural network training across a cluster of GPUs, and finds that reduction trees are more efficient and scalable than the traditional parameter server approach.
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency.
Scaling deep learning on GPU and knights landing clusters
A redesign of four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters, which are faster than existing counter-part methods (Async SGD, Async MSGD, and Hogwild SGD) in all comparisons.