A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification

  title={A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification},
  author={Shaohuai Shi and Kaiyong Zhao and Qiang Wang and Zhenheng Tang and Xiaowen Chu},
  booktitle={International Joint Conference on Artificial Intelligence},
Gradient sparsification is a promising technique to significantly reduce the communication overhead in decentralized synchronous stochastic gradient descent (S-SGD) algorithms. Yet, many existing gradient sparsification schemes (e.g., Top-k sparsification) have a communication complexity of O(kP), where k is the number of selected gradients by each worker and P is the number of workers. Recently, the gTop-k sparsification scheme has been proposed to reduce the communication complexity from O(kP… 

Figures and Tables from this paper

Understanding Top-k Sparsification in Distributed Deep Learning

The property of gradient distribution is exploited to propose an approximate top-$k$ selection algorithm, which is computing-efficient for GPUs, to improve the scaling efficiency of TopK-SGD by significantly reducing the computing overhead.

Communication-Efficient Distributed Deep Learning with Merged Gradient Sparsification on GPUs

  • S. ShiQiang Wang Xin Zhao
  • Computer Science
    IEEE INFOCOM 2020 - IEEE Conference on Computer Communications
  • 2020
The trade-off between communications and computations (including backward computation and gradient sparsification) is formulated as an optimization problem, and an optimal solution to the problem is derived.

Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

A new distributed optimization method named LAGS-SGD is proposed, which combines S- SGD with a novel layer-wise adaptive gradient sparsification (LAGS) scheme, which has convergence guarantees and it has the same order of convergence rate as vanilla S-SGd under a weak analytical assumption.

O(1) Communication for Distributed SGD through Two-Level Gradient Averaging

A2SGD is the first to achieve $\mathcal{O}$ (1) communication complexity per worker without incurring a significant accuracy degradation of DNN models while communicating only two scalars representing gradients per worker for distributed SGD.

MIPD: An Adaptive Gradient Sparsification Framework for Distributed DNNs Training

MIPD is proposed, an adaptive and layer-wised gradients sparsification framework that compresses the gradients based on model interpretability and probability distribution of gradients, which ensures high accuracy as compared to state-of-art solutions.

Adaptive Gradient Sparsification for Efficient Federated Learning: An Online Learning Approach

This paper presents a fairness-aware GS method which ensures that different clients provide a similar amount of updates, and proposes a novel online learning formulation and algorithm for automatically determining the near-optimal communication and computation trade-off that is controlled by the degree of gradient sparsity.

Exploiting Simultaneous Communications to Accelerate Data Parallel Distributed Deep Learning

  • S. ShiX. ChuBo Li
  • Computer Science
    IEEE INFOCOM 2021 - IEEE Conference on Computer Communications
  • 2021
This paper forms an optimization problem of minimizing the training iteration time, in which both tensor fusion and simultaneous communications are allowed, and develops an efficient optimal scheduling solution and implements the distributed training algorithm ASC-WFBP.

Empirical Analysis on Top-k Gradient Sparsification for Distributed Deep Learning in a Supercomputing Environment

Experiments are conducted that show the inefficiency of Top-k SGD and provide the insight of the low performance and plan to yield a high performance gradient sparsi fication method as a future work.

GossipFL: A Decentralized Federated Learning Framework With Sparsified and Adaptive Communication

This work designs a novel sparsification algorithm to enable that each client only needs to communicate with one peer with a highly sparsified model, and proposes a new and novel gossip matrix generation algorithm that can better utilize the bandwidth resources while preserving the convergence property.

Analysis of Error Feedback in Federated Non-Convex Optimization with Biased Compression

This work develops a new analysis of the EF under partial client participation, which is an important scenario in FL and proves that under partial participation, the convergence rate of Fed-EF exhibits an extra slow-down factor due to a so-called “stale error compensation” effect.



ATOMO: Communication-efficient Learning via Atomic Sparsification

ATOMO is presented, a general framework for atomic sparsification of stochastic gradients and it is shown that methods such as QSGD and TernGrad are special cases of ATOMO and sparsifiying gradients in their singular value decomposition (SVD) can lead to significantly faster distributed training.

Gradient Sparsification for Communication-Efficient Distributed Optimization

This paper proposes a convex optimization formulation to minimize the coding length of stochastic gradients and experiments on regularized logistic regression, support vector machines, and convolutional neural networks validate the proposed approaches.

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

This paper finds 99.9% of the gradient exchange in distributed SGD is redundant, and proposes Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth, which enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributedTraining on mobile.

MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms

  • S. ShiX. Chu
  • Computer Science
    IEEE INFOCOM 2019 - IEEE Conference on Computer Communications
  • 2019
This paper develops an optimal solution named merged-gradient wait-free backpropagation (MG-WFBP) and implements it in the open-source deep learning platform B-Caffe and shows that the MG-WF BP algorithm can achieve much better scaling efficiency than existing methods WFBP and SyncEASGD.

Round-Robin Synchronization: Mitigating Communication Bottlenecks in Parameter Servers

  • Chen ChenWei WangBo Li
  • Computer Science
    IEEE INFOCOM 2019 - IEEE Conference on Computer Communications
  • 2019
This paper proposes the Round-Robin Synchronous Parallel (R2SP) scheme, which coordinates workers to make updates in an evenly-gapped, round-robin manner, and extends R2SP to heterogeneous clusters by adaptively tuning the batch size of each worker based on its processing capability.

AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training

This paper introduces a novel technique - the Adaptive Residual Gradient Compression (AdaComp) scheme, based on localized selection of gradient residues and automatically tunes the compression rate depending on local activity.

Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes

This work builds a highly scalable deep learning training system for dense GPU clusters with three main contributions: a mixed-precision training method that significantly improves the training throughput of a single GPU without losing accuracy, an optimization approach for extremely large mini-batch size that can train CNN models on the ImageNet dataset without lost accuracy, and highly optimized all-reduce algorithms.

SparCML: high-performance sparse communication for machine learning

The generic communication library, SparCML1, extends MPI to support additional features, such as non-blocking (asynchronous) operations and low-precision data representations, and will form the basis of future highly-scalable machine learning frameworks.

Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs

  • S. ShiXiaowen Chu
  • Computer Science
    2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech)
  • 2018
This study evaluates the running performance of four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet, and TensorFlow) over single-GPU, multi- GPU, and multi-node environments and identifies bottlenecks and overheads which could be further optimized.

Deep Residual Learning for Image Recognition

This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.