• Corpus ID: 4782856

GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent

  title={GossipGraD: Scalable Deep Learning using Gossip Communication based Asynchronous Gradient Descent},
  author={Jeff A. Daily and Abhinav Vishnu and Charles Martin Siegel and Thomas E. Warfel and Vinay C. Amatya},
In this paper, we present GossipGraD - a gossip communication protocol based Stochastic Gradient Descent (SGD) algorithm for scaling Deep Learning (DL) algorithms on large-scale systems. The salient features of GossipGraD are: 1) reduction in overall communication complexity from {\Theta}(log(p)) for p compute nodes in well-studied SGD to O(1), 2) model diffusion such that compute nodes exchange their updates (gradients) indirectly after every log(p) steps, 3) rotation of communication partners… 

Accelerating Gossip-Based Deep Learning in Heterogeneous Edge Computing Platforms

The proposed EdgeGossip, a framework specifically designed to accelerate the training process of decentralized and Gossip-based DL training for heterogeneous EC platforms, is implemented based on popular Gossip algorithms and demonstrates its effectiveness using real-world DL workloads.

Breaking (Global) Barriers in Parallel Stochastic Optimization With Wait-Avoiding Group Averaging

This work presents Wait-Avoiding Group Model Averaging (WAGMA) SGD, a wait-avoiding stochastic optimizer that reduces global communication via subgroup weight exchange and proves the convergence of WAGMA-SGD, and empirically shows that it retains convergence rates similar to Allreduce- SGD.

Communication Scheduling for Gossip SGD in a Wide Area Network

A type of gossip SGD in which computation and communication overlap to accelerate learning is proposed, which is effective in both homogeneous and heterogeneous networks.

Communication-Efficient Distributed Deep Learning: A Comprehensive Survey

A comprehensive survey of the communication-efficient distributed training algorithms in both system-level and algorithmic-level optimizations is provided, which provides the readers to understand what algorithms are more efficient under specific distributed environments and extrapolate potential directions for further optimizations.

Decentralized trustless gossip training of deep neural networks

This work proposes a novel protocol for exchanging the model knowledge between peers using a gossip algorithm combined with the stochastic gradient descent (SGD), which has the advantage of being fully asynchronous, decentralized, trustless, and independent of the network size and the churn ratio.

EventGraD: Event-Triggered Communication in Parallel Machine Learning

Efficient Asynchronous GCN Training on a GPU Cluster

  • Y. ZhangD. Goswami
  • Computer Science
    2021 IEEE 27th International Conference on Parallel and Distributed Systems (ICPADS)
  • 2021
This research investigates approaches for asynchronous decentralized parallel training of GCNs on a GPU cluster based on graph clustering and the Gossip protocol and demonstrates superior performance with similar accuracy scores, as compared to traditional synchronous training which uses “all reduce” to synchronously accumulate parallel training results.

Priority-based Parameter Propagation for Distributed DNN Training

This paper proposes a new synchronization mechanism called Priority-based Parameter Propagation (P3), which synchronizes parameters at a finer granularity and schedules data transmission in such a way that the training process incurs minimal communication delay.

Addressing the Heterogeneity of A Wide Area Network for DNNs

It is shown that the congestion problem can be solved by adjusting the communication frequency, that is, by training multiple times and communicating once, and a warm-up technique is proposed to improve the learning efficiency.

EventGraD: Event-Triggered Communication in Parallel Stochastic Gradient Descent

  • Soumyadip GhoshV. Gupta
  • Computer Science
    2020 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments (MLHPC) and Workshop on Artificial Intelligence and Machine Learning for Scientific Applications (AI4S)
  • 2020
This paper focuses on data-parallel training of a popular convolutional neural network used for training the MNIST dataset and shows that EventGraD can reduce the communication load by up to 70% while retaining the same level of accuracy.



Gossip training for deep learning

A new way to share information between different threads inspired by gossip algorithms and showing good consensus convergence properties is proposed, which has the advantage to be fully asynchronous and decentralized.

Asynchronous Stochastic Gradient Descent with Delay Compensation for Distributed Deep Learning

This work proposes a novel technology to compensate this delay, so as to make the optimization behavior of ASGD closer to that of sequential SGD, and calls the corresponding new algorithm Delay Compensated ASGD (DC-ASGD).

Large Scale Distributed Deep Networks

This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.

Parle: parallelizing stochastic gradient descent

We propose a new algorithm called Parle for parallel training of deep networks that converges 2-4x faster than a data-parallel implementation of SGD, while achieving significantly improved error

SparkNet: Training Deep Networks in Spark

This work introduces SparkNet, a framework for training deep networks in Spark using a simple parallelization scheme for stochastic gradient descent that scales well with the cluster size and tolerates very high-latency communication.

Scaling SGD Batch Size to 32K for ImageNet Training

Layer-wise Adaptive Rate Scaling (LARS) is proposed, a method to enable large-batch training to general networks or datasets, and it can scale the batch size to 32768 for ResNet50 and 8192 for AlexNet.


A software-hardware co-optimized distributed Deep Learning system that can achieve near-linear scaling up to hundreds of GPUs using a multi-ring communication pattern that provides a good tradeoff between latency and bandwidth and adapts to a variety of system configurations.

GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server

GeePS enables a state-of-the-art single-node GPU implementation to scale well, such as to 13 times the number of training images processed per second on 16 machines (relative to the original optimized single- node code), and achieves a higher training throughput with just four GPU machines than that a state of theart CPU-only system achieves with 108 machines.

An introduction to computational networks and the computational network toolkit (invited talk)

The computational network toolkit (CNTK), an implementation of CN that supports both GPU and CPU, is introduced and the architecture and the key components of the CNTK are described, the command line options to use C NTK, and the network definition and model editing language are described.

Greedy Layer-Wise Training of Deep Networks

These experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.