MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms
@article{Shi2019MGWFBPED, title={MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms}, author={Shaohuai Shi and Xiaowen Chu}, journal={IEEE INFOCOM 2019 - IEEE Conference on Computer Communications}, year={2019}, pages={172-180} }
Distributed synchronous stochastic gradient descent has been widely used to train deep neural networks on computer clusters. With the increase of computational power, network communications have become one limiting factor on the system scalability. In this paper, we observe that many deep neural networks have a large number of layers with only a small amount of data to be communicated. Based on the fact that merging some short communication tasks into a single one may reduce the overall… CONTINUE READING
Supplemental Code
Github Repo
MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms
Github Repo
MG-WFBP: Efficient Data Communication for Distributed Synchronous SGD Algorithms
Figures, Tables, and Topics from this paper
26 Citations
MG-WFBP: Merging Gradients Wisely for Efficient Communication in Distributed Deep Learning
- Computer Science
- IEEE Transactions on Parallel and Distributed Systems
- 2021
- PDF
A Distributed Synchronous SGD Algorithm with Global Top-k Sparsification for Low Bandwidth Networks
- Computer Science
- 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS)
- 2019
- 35
- PDF
Communication-Efficient Distributed Deep Learning with Merged Gradient Sparsification on GPUs
- Computer Science
- IEEE INFOCOM 2020 - IEEE Conference on Computer Communications
- 2020
- 8
- PDF
Communication-Efficient Distributed Deep Learning: A Comprehensive Survey
- Computer Science, Engineering
- ArXiv
- 2020
- 11
- PDF
Communication optimization strategies for distributed deep neural network training: A survey
- Computer Science
- J. Parallel Distributed Comput.
- 2021
- 1
- PDF
A Quantitative Survey of Communication Optimizations in Distributed Deep Learning.
- Computer Science
- 2020
- 2
- PDF
Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees
- Computer Science, Mathematics
- ECAI
- 2020
- 4
- PDF
Communication Optimization Strategies for Distributed Deep Learning: A Survey
- Computer Science
- ArXiv
- 2020
- 5
Preemptive All-reduce Scheduling for Expediting Distributed DNN Training
- Computer Science
- IEEE INFOCOM 2020 - IEEE Conference on Computer Communications
- 2020
- 6
- Highly Influenced
- PDF
Towards Scalable Distributed Training of Deep Learning on Public Cloud Clusters
- Computer Science
- ArXiv
- 2020
- PDF
References
SHOWING 1-10 OF 31 REFERENCES
A DAG Model of Synchronous Stochastic Gradient Descent in Distributed Deep Learning
- Computer Science
- 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS)
- 2018
- 9
- PDF
A Distributed Synchronous SGD Algorithm with Global Top-k Sparsification for Low Bandwidth Networks
- Computer Science
- 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS)
- 2019
- 35
- PDF
Poseidon: An Efficient Communication Architecture for Distributed Deep Learning on GPU Clusters
- Computer Science, Mathematics
- USENIX Annual Technical Conference
- 2017
- 153
- Highly Influential
- PDF
TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning
- Computer Science, Mathematics
- NIPS
- 2017
- 425
- PDF
GeePS: scalable deep learning on distributed GPUs with a GPU-specialized parameter server
- Computer Science
- EuroSys
- 2016
- 196
- PDF
Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
- Computer Science, Mathematics
- ICLR
- 2018
- 452
- PDF
Round-Robin Synchronization: Mitigating Communication Bottlenecks in Parameter Servers
- Computer Science
- IEEE INFOCOM 2019 - IEEE Conference on Computer Communications
- 2019
- 21
- PDF
Highly Scalable Deep Learning Training System with Mixed-Precision: Training ImageNet in Four Minutes
- Computer Science, Mathematics
- ArXiv
- 2018
- 216
- PDF