Sparse-Push: Communication- & Energy-Efficient Decentralized Distributed Learning over Directed & Time-Varying Graphs with non-IID Datasets
@article{Aketi2021SparsePushC, title={Sparse-Push: Communication- \& Energy-Efficient Decentralized Distributed Learning over Directed \& Time-Varying Graphs with non-IID Datasets}, author={Sai Aparna Aketi and Amandeep Singh and Jan M. Rabaey}, journal={ArXiv}, year={2021}, volume={abs/2102.05715} }
Current deep learning (DL) systems rely on a centralized computing paradigm which limits the amount of available training data, increases system latency and adds privacy & security constraints. On-device learning, enabled by decentralized and distributed training of DL models over peer-to-peer wirelessly connected edge devices, not only alleviate the above limitations but also enable next-gen applications that need DL models to continuously interact and learn from their environment. However…
Figures and Tables from this paper
4 Citations
Low Precision Decentralized Distributed Training with Heterogeneous Data
- Computer ScienceArXiv
- 2021
The proposed low precision decentralized training decreases computational complexity, memory usage, and communication cost by ∼ 4× while trading off less than a 1% accuracy for both IID and non-IID data, indicating the regularization effect of the quantization.
Low precision decentralized distributed training over IID and non-IID data
- Computer ScienceNeural Networks
- 2022
Decentralized Learning with Separable Data: Generalization and Fast Algorithms
- Computer ScienceArXiv
- 2022
Improved gradient-based routines for decentralized learning with separable data are designed and empirically demonstrate orders-of-magnitude of speed-up in terms of both training and generalization performance.
Neighborhood Gradient Clustering: An Efficient Decentralized Learning Method for Non-IID Data Distributions
- Computer ScienceArXiv
- 2022
The experiments demonstrate that the proposed neighborhood Gradient Clustering algorithm and a compressed version of it outperform the existing SoTA decentralized learning algorithm over non-IID data with significantly less compute and memory requirements and show that the model-variant cross-gradient information available locally at each agent can improve the performance overNon-I ID data by $1-35$ without additional communication cost.
References
SHOWING 1-10 OF 16 REFERENCES
Decentralized Deep Learning with Arbitrary Communication Compression
- Computer ScienceICLR
- 2020
The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods.
Communication Compression for Decentralized Training
- Computer ScienceNeurIPS
- 2018
This paper develops a framework of quantized, decentralized training and proposes two different strategies, which are called extrapolation compression and difference compression, which outperforms the best of merely decentralized and merely quantized algorithm significantly for networks with high latency and low bandwidth.
Quantized Decentralized Stochastic Learning over Directed Graphs
- Computer ScienceICML
- 2020
This paper proposes the quantized decentralized stochastic learning algorithm over directed graphs that is based on the push-sum algorithm in decentralized consensus optimization and proves that this algorithm achieves the same convergence rates of the decentralized Stochastic Learning algorithm with exact-communication for both convex and non-convex losses.
Network Topology and Communication-Computation Tradeoffs in Decentralized Optimization
- Computer ScienceProceedings of the IEEE
- 2018
This paper presents an overview of recent work in decentralized optimization and surveys the state-of-theart algorithms and their analyses tailored to these different scenarios, highlighting the role of the network topology.
Distributed optimization over time-varying directed graphs
- Computer Science, Mathematics52nd IEEE Conference on Decision and Control
- 2013
This work develops a broadcast-based algorithm, termed the subgradient-push, which steers every node to an optimal value under a standard assumption of subgradient boundedness, which converges at a rate of O (ln t/√t), where the constant depends on the initial values at the nodes, the sub gradient norms, and, more interestingly, on both the consensus speed and the imbalances of influence among the nodes.
Asynchronous Decentralized Parallel Stochastic Gradient Descent
- Computer ScienceICML
- 2018
This paper proposes an asynchronous decentralized stochastic gradient decent algorithm (AD-PSGD) satisfying all above expectations and is the first asynchronous algorithm that achieves a similar epoch-wise convergence rate as AllReduce-SGD, at an over 100-GPU scale.
The Non-IID Data Quagmire of Decentralized Machine Learning
- Computer ScienceICML
- 2020
SkewScout is presented, a system-level approach that adapts the communication frequency of decentralized learning algorithms to the (skew-induced) accuracy loss between data partitions and it is shown that group normalization can recover much of the accuracy loss of batch normalization.
D2: Decentralized Training over Decentralized Data
- Computer ScienceICML
- 2018
D$2, a novel decentralized parallel stochastic gradient descent algorithm designed for large data variance among workers, is presented and empirically evaluated on image classification tasks where each worker has access to only the data of a limited set of labels, and significantly outperforms D-PSGD.
QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
- Computer ScienceNIPS
- 2017
Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques.
Stochastic Gradient Push for Distributed Deep Learning
- Computer ScienceICML
- 2019
Stochastic Gradient Push is studied, it is proved that SGP converges to a stationary point of smooth, non-convex objectives at the same sub-linear rate as SGD, and that all nodes achieve consensus.