A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification

@inproceedings{Shi2019ACA,
  title={A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification},
  author={Shaohuai Shi and Kaiyong Zhao and Qiang Wang and Zhenheng Tang and Xiaowen Chu},
  booktitle={IJCAI},
  year={2019}
}
Gradient sparsification is a promising technique to significantly reduce the communication overhead in decentralized synchronous stochastic gradient descent (S-SGD) algorithms. Yet, many existing gradient sparsification schemes (e.g., Top-k sparsification) have a communication complexity of O(kP), where k is the number of selected gradients by each worker and P is the number of workers. Recently, the gTop-k sparsification scheme has been proposed to reduce the communication complexity from O(kP… Expand

Figures, Tables, and Topics from this paper

Understanding Top-k Sparsification in Distributed Deep Learning
TLDR
The property of gradient distribution is exploited to propose an approximate top-$k$ selection algorithm, which is computing-efficient for GPUs, to improve the scaling efficiency of TopK-SGD by significantly reducing the computing overhead. Expand
A Distributed SGD Algorithm with Global Sketching for Deep Learning Training Acceleration
  • LingFei Dai, Boyu Diao, Chao Li, Yongjun Xu
  • Computer Science
  • ArXiv
  • 2021
TLDR
This work proposed a gradient compression method with globe gradient vector sketching, which uses the Count − Sketch structure to store the gradients to reduce the loss of the accuracy in the training process, named global-sketching SGD (gs-SGD). Expand
Error-Compensated Sparsification for Communication-Efficient Decentralized Training in Edge Environment
TLDR
This work designs a method named ECSD-SGD that significantly accelerates decentralized training via error-compensated sparsification and outperforms all the start-of-the-art sparsified methods in terms of both the convergence speed and the final generalization accuracy. Expand
Communication-Efficient Distributed Deep Learning with Merged Gradient Sparsification on GPUs
TLDR
The trade-off between communications and computations (including backward computation and gradient sparsification) is formulated as an optimization problem, and an optimal solution to the problem is derived. Expand
Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees
TLDR
A new distributed optimization method named LAGS-SGD is proposed, which combines S- SGD with a novel layer-wise adaptive gradient sparsification (LAGS) scheme, which has convergence guarantees and it has the same order of convergence rate as vanilla S-SGd under a weak analytical assumption. Expand
O(1) Communication for Distributed SGD through Two-Level Gradient Averaging
TLDR
A2SGD is the first to achieve O(1) communication complexity per worker for distributed SGD, and validates the theoretical conclusion and demonstrates that A2 SGD significantly reduces the communication traffic per worker, and improves the overall training time of LSTM-PTB. Expand
Adaptive Gradient Sparsification for Efficient Federated Learning: An Online Learning Approach
TLDR
This paper presents a fairness-aware GS method which ensures that different clients provide a similar amount of updates, and proposes a novel online learning formulation and algorithm for automatically determining the near-optimal communication and computation trade-off that is controlled by the degree of gradient sparsity. Expand
Exploiting Simultaneous Communications to Accelerate Data Parallel Distributed Deep Learning
  • S. Shi, X. Chu, Bo Li
  • Computer Science
  • IEEE INFOCOM 2021 - IEEE Conference on Computer Communications
  • 2021
TLDR
This paper forms an optimization problem of minimizing the training iteration time, in which both tensor fusion and simultaneous communications are allowed, and develops an efficient optimal scheduling solution and implements the distributed training algorithm ASC-WFBP. Expand
Communication-Efficient Distributed Deep Learning: A Comprehensive Survey
TLDR
A comprehensive survey of the communication-efficient distributed training algorithms in both system-level and algorithmic-level optimizations is provided, which provides the readers to understand what algorithms are more efficient under specific distributed environments and extrapolate potential directions for further optimizations. Expand
Communication Efficient Sparsification for Large Scale Machine Learning
TLDR
The theoretical results and experiments indicate that the automatic tuning strategies significantly increase communication efficiency on several state-of-the-art compression schemes. Expand
...
1
2
...

References

SHOWING 1-10 OF 20 REFERENCES
ATOMO: Communication-efficient Learning via Atomic Sparsification
TLDR
ATOMO is presented, a general framework for atomic sparsification of stochastic gradients and it is shown that methods such as QSGD and TernGrad are special cases of ATOMO and sparsifiying gradients in their singular value decomposition (SVD) can lead to significantly faster distributed training. Expand
Gradient Sparsification for Communication-Efficient Distributed Optimization
TLDR
This paper proposes a convex optimization formulation to minimize the coding length of stochastic gradients and experiments on regularized logistic regression, support vector machines, and convolutional neural networks validate the proposed approaches. Expand
Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
TLDR
This paper finds 99.9% of the gradient exchange in distributed SGD is redundant, and proposes Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth, which enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributedTraining on mobile. Expand
MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms
  • S. Shi, X. Chu
  • Computer Science
  • IEEE INFOCOM 2019 - IEEE Conference on Computer Communications
  • 2019
TLDR
This paper develops an optimal solution named merged-gradient wait-free backpropagation (MG-WFBP) and implements it in the open-source deep learning platform B-Caffe and shows that the MG-WF BP algorithm can achieve much better scaling efficiency than existing methods WFBP and SyncEASGD. Expand
Round-Robin Synchronization: Mitigating Communication Bottlenecks in Parameter Servers
  • C. Chen, W. Wang, B. Li
  • Computer Science
  • IEEE INFOCOM 2019 - IEEE Conference on Computer Communications
  • 2019
TLDR
This paper proposes the Round-Robin Synchronous Parallel (R2SP) scheme, which coordinates workers to make updates in an evenly-gapped, round-robin manner, and extends R2SP to heterogeneous clusters by adaptively tuning the batch size of each worker based on its processing capability. Expand
AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training
TLDR
This paper introduces a novel technique - the Adaptive Residual Gradient Compression (AdaComp) scheme, based on localized selection of gradient residues and automatically tunes the compression rate depending on local activity. Expand
Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes
TLDR
This work builds a highly scalable deep learning training system for dense GPU clusters with three main contributions: a mixed-precision training method that significantly improves the training throughput of a single GPU without losing accuracy, an optimization approach for extremely large mini-batch size that can train CNN models on the ImageNet dataset without lost accuracy, and highly optimized all-reduce algorithms. Expand
SparCML: high-performance sparse communication for machine learning
TLDR
The generic communication library, SparCML1, extends MPI to support additional features, such as non-blocking (asynchronous) operations and low-precision data representations, and will form the basis of future highly-scalable machine learning frameworks. Expand
Performance Modeling and Evaluation of Distributed Deep Learning Frameworks on GPUs
  • S. Shi, Xiaowen Chu
  • Computer Science
  • 2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congress(DASC/PiCom/DataCom/CyberSciTech)
  • 2018
TLDR
This study evaluates the running performance of four state-of-the-art distributed deep learning frameworks (i.e., Caffe-MPI, CNTK, MXNet, and TensorFlow) over single-GPU, multi- GPU, and multi-node environments and identifies bottlenecks and overheads which could be further optimized. Expand
Deep Residual Learning for Image Recognition
TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. Expand
...
1
2
...