# GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent

@article{Daily2018GossipGraDSD, title={GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent}, author={Jeff A. Daily and Abhinav Vishnu and Charles Martin Siegel and Thomas E. Warfel and Vinay C. Amatya}, journal={ArXiv}, year={2018}, volume={abs/1803.05880} }

In this paper, we present GossipGraD - a gossip communication protocol based Stochastic Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale systems. The salient features of GossipGraD are: 1) reduction in overall communication complexity from {\Theta}(log(p)) for p compute nodes in well-studied SGD to O(1), 2) model diffusion such that compute nodes exchange their updates (gradients) indirectly after every log(p) steps, 3) rotation of communication partners…

## Figures and Tables from this paper

## 62 Citations

Accelerating Gossip-Based Deep Learning in Heterogeneous Edge Computing Platforms

- Computer ScienceIEEE Transactions on Parallel and Distributed Systems
- 2021

The proposed EdgeGossip, a framework specifically designed to accelerate the training process of decentralized and Gossip-based DL training for heterogeneous EC platforms, is implemented based on popular Gossip algorithms and demonstrates its effectiveness using real-world DL workloads.

Communication Scheduling for Gossip SGD in a Wide Area Network

- Computer ScienceIEEE Access
- 2021

A type of gossip SGD in which computation and communication overlap to accelerate learning is proposed, which is effective in both homogeneous and heterogeneous networks.

Priority-based parameter propagation for distributed deep neural network training

- Computer Science
- 2019

This work proposes a new synchronization mechanism called Priority-based Parameter Propagation (P3), which synchronizes parameters at a finer granularity and schedules data transmission in such a way that the training process incurs minimal communication delay.

Decentralized trustless gossip training of deep neural networks

- Computer Science2020 43rd International Convention on Information, Communication and Electronic Technology (MIPRO)
- 2020

This work proposes a novel protocol for exchanging the model knowledge between peers using a gossip algorithm combined with the stochastic gradient descent (SGD), which has the advantage of being fully asynchronous, decentralized, trustless, and independent of the network size and the churn ratio.

EventGraD: Event-Triggered Communication in Parallel Machine Learning

- Computer ScienceNeurocomputing
- 2022

Efficient Asynchronous GCN Training on a GPU Cluster

- Computer Science2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)
- 2021

This research investigates approaches for asynchronous decentralized parallel training of GCNs on a GPU cluster based on graph clustering and the Gossip protocol and demonstrates superior performance with similar accuracy scores, as compared to traditional synchronous training which uses “all reduce” to synchronously accumulate parallel training results.

Priority-based Parameter Propagation for Distributed DNN Training

- Computer ScienceMLSys
- 2019

This paper proposes a new synchronization mechanism called Priority-based Parameter Propagation (P3), which synchronizes parameters at a finer granularity and schedules data transmission in such a way that the training process incurs minimal communication delay.

Addressing the Heterogeneity of A Wide Area Network for DNNs

- Computer Science2021 IEEE 18th Annual Consumer Communications & Networking Conference (CCNC)
- 2021

It is shown that the congestion problem can be solved by adjusting the communication frequency, that is, by training multiple times and communicating once, and a warm-up technique is proposed to improve the learning efficiency.

EventGraD: Event-Triggered Communication in Parallel Stochastic Gradient Descent

- Computer Science2020 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) and Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S)
- 2020

This paper focuses on data-parallel training of a popular convolutional neural network used for training the MNIST dataset and shows that EventGraD can reduce the communication load by up to 70% while retaining the same level of accuracy.

Demystifying Parallel and Distributed Deep Learning

- Computer ScienceACM Comput. Surv.
- 2019

The problem of parallelization in DNNs is described from a theoretical perspective, followed by approaches for its parallelization, and potential directions for parallelism in deep learning are extrapolated.

## References

SHOWING 1-10 OF 73 REFERENCES

Large Scale Distributed Deep Networks

- Computer ScienceNIPS
- 2012

This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.

Distributed Deep Learning Using Synchronous Stochastic Gradient Descent

- Computer ScienceArXiv
- 2016

A distributed multinode synchronous SGD algorithm is designed and implemented, without altering hyper parameters, or compressing data, or altering algorithmic behavior, and the generality of this approach is demonstrated via best-in-class 6.5X scaling for a 7-layer DNN on 16 nodes.

Parle: parallelizing stochastic gradient descent

- Computer ScienceArXiv
- 2017

We propose a new algorithm called Parle for parallel training of deep networks that converges 2-4x faster than a data-parallel implementation of SGD, while achieving significantly improved error…

PowerAI DDL

- Computer ScienceArXiv
- 2017

A software-hardware co-optimized distributed Deep Learning system that can achieve near-linear scaling up to hundreds of GPUs using a multi-ring communication pattern that provides a good tradeoff between latency and bandwidth and adapts to a variety of system configurations.

GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server

- Computer ScienceEuroSys
- 2016

GeePS enables a state-of-the-art single-node GPU implementation to scale well, such as to 13 times the number of training images processed per second on 16 machines (relative to the original optimized single- node code), and achieves a higher training throughput with just four GPU machines than that a state of theart CPU-only system achieves with 108 machines.

Greedy Layer-Wise Training of Deep Networks

- Computer ScienceNIPS
- 2006

These experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

Large Batch Training of Convolutional Networks

- Computer Science
- 2017

It is argued that the current recipe for large batch training (linear learning rate scaling with warm-up) is not general enough and training may diverge and a new training algorithm based on Layer-wise Adaptive Rate Scaling (LARS) is proposed.

FireCaffe: Near-Linear Acceleration of Deep Neural Network Training on Compute Clusters

- Computer Science2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- 2016

FireCaffe is presented, which successfully scales deep neural network training across a cluster of GPUs, and finds that reduction trees are more efficient and scalable than the traditional parameter server approach.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

- Computer ScienceArXiv
- 2017

This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency.

Scaling deep learning on GPU and knights landing clusters

- Computer ScienceSC
- 2017

A redesign of four efficient algorithms for HPC systems to improve EASGD's poor scaling on clusters, which are faster than existing counter-part methods (Async SGD, Async MSGD, and Hogwild SGD) in all comparisons.