• Corpus ID: 231880013

Sparse-Push: Communication- & Energy-Efficient Decentralized Distributed Learning over Directed & Time-Varying Graphs with non-IID Datasets

  title={Sparse-Push: Communication- \& Energy-Efficient Decentralized Distributed Learning over Directed \& Time-Varying Graphs with non-IID Datasets},
  author={Sai Aparna Aketi and Amandeep Singh and Jan M. Rabaey},
Current deep learning (DL) systems rely on a centralized computing paradigm which limits the amount of available training data, increases system latency and adds privacy & security constraints. On-device learning, enabled by decentralized and distributed training of DL models over peer-to-peer wirelessly connected edge devices, not only alleviate the above limitations but also enable next-gen applications that need DL models to continuously interact and learn from their environment. However… 

Low Precision Decentralized Distributed Training with Heterogeneous Data

The proposed low precision decentralized training decreases computational complexity, memory usage, and communication cost by ∼ 4× while trading off less than a 1% accuracy for both IID and non-IID data, indicating the regularization effect of the quantization.

Decentralized Learning with Separable Data: Generalization and Fast Algorithms

Improved gradient-based routines for decentralized learning with separable data are designed and empirically demonstrate orders-of-magnitude of speed-up in terms of both training and generalization performance.

Neighborhood Gradient Clustering: An Efficient Decentralized Learning Method for Non-IID Data Distributions

The experiments demonstrate that the proposed neighborhood Gradient Clustering algorithm and a compressed version of it outperform the existing SoTA decentralized learning algorithm over non-IID data with significantly less compute and memory requirements and show that the model-variant cross-gradient information available locally at each agent can improve the performance overNon-I ID data by $1-35$ without additional communication cost.



Decentralized Deep Learning with Arbitrary Communication Compression

The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods.

Communication Compression for Decentralized Training

This paper develops a framework of quantized, decentralized training and proposes two different strategies, which are called extrapolation compression and difference compression, which outperforms the best of merely decentralized and merely quantized algorithm significantly for networks with high latency and low bandwidth.

Quantized Decentralized Stochastic Learning over Directed Graphs

This paper proposes the quantized decentralized stochastic learning algorithm over directed graphs that is based on the push-sum algorithm in decentralized consensus optimization and proves that this algorithm achieves the same convergence rates of the decentralized Stochastic Learning algorithm with exact-communication for both convex and non-convex losses.

Network Topology and Communication-Computation Tradeoffs in Decentralized Optimization

This paper presents an overview of recent work in decentralized optimization and surveys the state-of-theart algorithms and their analyses tailored to these different scenarios, highlighting the role of the network topology.

Distributed optimization over time-varying directed graphs

This work develops a broadcast-based algorithm, termed the subgradient-push, which steers every node to an optimal value under a standard assumption of subgradient boundedness, which converges at a rate of O (ln t/√t), where the constant depends on the initial values at the nodes, the sub gradient norms, and, more interestingly, on both the consensus speed and the imbalances of influence among the nodes.

Asynchronous Decentralized Parallel Stochastic Gradient Descent

This paper proposes an asynchronous decentralized stochastic gradient decent algorithm (AD-PSGD) satisfying all above expectations and is the first asynchronous algorithm that achieves a similar epoch-wise convergence rate as AllReduce-SGD, at an over 100-GPU scale.

The Non-IID Data Quagmire of Decentralized Machine Learning

SkewScout is presented, a system-level approach that adapts the communication frequency of decentralized learning algorithms to the (skew-induced) accuracy loss between data partitions and it is shown that group normalization can recover much of the accuracy loss of batch normalization.

D2: Decentralized Training over Decentralized Data

D$2, a novel decentralized parallel stochastic gradient descent algorithm designed for large data variance among workers, is presented and empirically evaluated on image classification tasks where each worker has access to only the data of a limited set of labels, and significantly outperforms D-PSGD.

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques.

Stochastic Gradient Push for Distributed Deep Learning

Stochastic Gradient Push is studied, it is proved that SGP converges to a stationary point of smooth, non-convex objectives at the same sub-linear rate as SGD, and that all nodes achieve consensus.