Peering Beyond the Gradient Veil with Distributed Auto Differentiation
@inproceedings{Baker2021PeeringBT, title={Peering Beyond the Gradient Veil with Distributed Auto Differentiation}, author={Bradley T. Baker and Aashis Khanal and Vince D. Calhoun and Barak A. Pearlmutter and S. Plis}, year={2021} }
Although distributed machine learning has opened up many new and exciting research frontiers, fragmentation of models and data across different machines, nodes, and sites still results in considerable communication overhead, impeding reliable training in real-world contexts. The focus on gradients as the primary shared statistic during training has spawned a number of intuitive algorithms for distributed deep learning; however, gradient-centric training of large deep neural networks (DNNs…
References
SHOWING 1-10 OF 40 REFERENCES
Distributed Learning with Compressed Gradient Differences
- Computer ScienceArXiv
- 2019
This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates.
Stochastic Gradient Push for Distributed Deep Learning
- Computer ScienceICML
- 2019
Stochastic Gradient Push is studied, it is proved that SGP converges to a stationary point of smooth, non-convex objectives at the same sub-linear rate as SGD, and that all nodes achieve consensus.
Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication
- Computer Science2019 International Joint Conference on Neural Networks (IJCNN)
- 2019
SBC combines existing techniques of communication delay and gradient sparsification with a novel binarization method and optimal weight update encoding to push compression gains to new limits to mitigate the limited communication bandwidth between contributing nodes or prohibitive communication cost for distributed training.
LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning
- Computer ScienceNeurIPS
- 2018
A new class of gradient methods for distributed machine learning that adaptively skip the gradient calculations to learn with reduced communication and computation is presented, justifying the acronym LAG used henceforth.
signSGD: compressed optimisation for non-convex problems
- Computer ScienceICML
- 2018
SignSGD can get the best of both worlds: compressed gradients and SGD-level convergence rate, and the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models.
Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
- Computer ScienceICLR
- 2018
This paper finds 99.9% of the gradient exchange in distributed SGD is redundant, and proposes Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth, which enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributedTraining on mobile.
GradiVeQ: Vector Quantization for Bandwidth-Efficient Gradient Aggregation in Distributed CNN Training
- Computer ScienceNeurIPS
- 2018
This paper empirically demonstrate the strong linear correlations between CNN gradients, and proposes a gradient vector quantization technique, named GradiVeQ, to exploit these correlations through principal component analysis (PCA) for substantial gradient dimension reduction.
TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning
- Computer ScienceNIPS
- 2017
This work mathematically proves the convergence of TernGrad under the assumption of a bound on gradients, and proposes layer-wise ternarizing and gradient clipping to improve its convergence.
Decentralized Deep Learning with Arbitrary Communication Compression
- Computer ScienceICLR
- 2020
The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods.
Large Scale Distributed Deep Networks
- Computer ScienceNIPS
- 2012
This paper considers the problem of training a deep network with billions of parameters using tens of thousands of CPU cores and develops two algorithms for large-scale distributed training, Downpour SGD and Sandblaster L-BFGS, which increase the scale and speed of deep network training.