# A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification

@inproceedings{Shi2019ACA,
title={A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification},
author={Shaohuai Shi and Kaiyong Zhao and Qiang Wang and Zhenheng Tang and Xiaowen Chu},
booktitle={International Joint Conference on Artificial Intelligence},
year={2019}
}
• Published in
International Joint…
1 August 2019
• Computer Science
Gradient sparsification is a promising technique to significantly reduce the communication overhead in decentralized synchronous stochastic gradient descent (S-SGD) algorithms. Yet, many existing gradient sparsification schemes (e.g., Top-k sparsification) have a communication complexity of O(kP), where k is the number of selected gradients by each worker and P is the number of workers. Recently, the gTop-k sparsification scheme has been proposed to reduce the communication complexity from O(kP…

## Figures and Tables from this paper

• Computer Science
ArXiv
• 2019
The property of gradient distribution is exploited to propose an approximate top-$k$ selection algorithm, which is computing-efficient for GPUs, to improve the scaling efficiency of TopK-SGD by significantly reducing the computing overhead.
• Computer Science
IEEE INFOCOM 2020 - IEEE Conference on Computer Communications
• 2020
The trade-off between communications and computations (including backward computation and gradient sparsification) is formulated as an optimization problem, and an optimal solution to the problem is derived.
• Computer Science
ECAI
• 2020
A new distributed optimization method named LAGS-SGD is proposed, which combines S- SGD with a novel layer-wise adaptive gradient sparsification (LAGS) scheme, which has convergence guarantees and it has the same order of convergence rate as vanilla S-SGd under a weak analytical assumption.
• Computer Science
2021 IEEE International Conference on Cluster Computing (CLUSTER)
• 2021
A2SGD is the first to achieve $\mathcal{O}$ (1) communication complexity per worker without incurring a significant accuracy degradation of DNN models while communicating only two scalars representing gradients per worker for distributed SGD.
• Computer Science
IEEE Transactions on Parallel and Distributed Systems
• 2022
MIPD is proposed, an adaptive and layer-wised gradients sparsification framework that compresses the gradients based on model interpretability and probability distribution of gradients, which ensures high accuracy as compared to state-of-art solutions.
• Computer Science
2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS)
• 2020
This paper presents a fairness-aware GS method which ensures that different clients provide a similar amount of updates, and proposes a novel online learning formulation and algorithm for automatically determining the near-optimal communication and computation trade-off that is controlled by the degree of gradient sparsity.
• Computer Science
IEEE INFOCOM 2021 - IEEE Conference on Computer Communications
• 2021
This paper forms an optimization problem of minimizing the training iteration time, in which both tensor fusion and simultaneous communications are allowed, and develops an efficient optimal scheduling solution and implements the distributed training algorithm ASC-WFBP.
• Computer Science
ArXiv
• 2022
Experiments are conducted that show the inefﬁciency of Top-k SGD and provide the insight of the low performance and plan to yield a high performance gradient sparsi ﬁcation method as a future work.
• Computer Science
IEEE Transactions on Parallel and Distributed Systems
• 2023
This work designs a novel sparsification algorithm to enable that each client only needs to communicate with one peer with a highly sparsified model, and proposes a new and novel gossip matrix generation algorithm that can better utilize the bandwidth resources while preserving the convergence property.
• Computer Science
ArXiv
• 2022
This work develops a new analysis of the EF under partial client participation, which is an important scenario in FL and proves that under partial participation, the convergence rate of Fed-EF exhibits an extra slow-down factor due to a so-called “stale error compensation” effect.

## References

SHOWING 1-10 OF 20 REFERENCES

• Computer Science
NeurIPS
• 2018
ATOMO is presented, a general framework for atomic sparsification of stochastic gradients and it is shown that methods such as QSGD and TernGrad are special cases of ATOMO and sparsifiying gradients in their singular value decomposition (SVD) can lead to significantly faster distributed training.
• Computer Science
NeurIPS
• 2018
This paper proposes a convex optimization formulation to minimize the coding length of stochastic gradients and experiments on regularized logistic regression, support vector machines, and convolutional neural networks validate the proposed approaches.
• Computer Science
ICLR
• 2018
This paper finds 99.9% of the gradient exchange in distributed SGD is redundant, and proposes Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth, which enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributedTraining on mobile.
• Computer Science
IEEE INFOCOM 2019 - IEEE Conference on Computer Communications
• 2019
This paper develops an optimal solution named merged-gradient wait-free backpropagation (MG-WFBP) and implements it in the open-source deep learning platform B-Caffe and shows that the MG-WF BP algorithm can achieve much better scaling efficiency than existing methods WFBP and SyncEASGD.
• Chen Chen
• Computer Science
IEEE INFOCOM 2019 - IEEE Conference on Computer Communications
• 2019
This paper proposes the Round-Robin Synchronous Parallel (R2SP) scheme, which coordinates workers to make updates in an evenly-gapped, round-robin manner, and extends R2SP to heterogeneous clusters by adaptively tuning the batch size of each worker based on its processing capability.
This paper introduces a novel technique - the Adaptive Residual Gradient Compression (AdaComp) scheme, based on localized selection of gradient residues and automatically tunes the compression rate depending on local activity.
• Computer Science
ArXiv
• 2018
This work builds a highly scalable deep learning training system for dense GPU clusters with three main contributions: a mixed-precision training method that significantly improves the training throughput of a single GPU without losing accuracy, an optimization approach for extremely large mini-batch size that can train CNN models on the ImageNet dataset without lost accuracy, and highly optimized all-reduce algorithms.
• Computer Science
SC
• 2019
The generic communication library, SparCML1, extends MPI to support additional features, such as non-blocking (asynchronous) operations and low-precision data representations, and will form the basis of future highly-scalable machine learning frameworks.
• Computer Science
2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech)
• 2018
This study evaluates the running performance of four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet, and TensorFlow) over single-GPU, multi- GPU, and multi-node environments and identifies bottlenecks and overheads which could be further optimized.
• Kaiming HeJian Sun
• Computer Science
2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
• 2016
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.