• Corpus ID: 244948602

# Asynchronous Decentralized SGD with Quantized and Local Updates

@inproceedings{Nadiradze2019AsynchronousDS,
title={Asynchronous Decentralized SGD with Quantized and Local Updates},
author={Giorgi Nadiradze and Amirmojtaba Sabour and Peter Davies and Shigang Li and Dan Alistarh},
booktitle={Neural Information Processing Systems},
year={2019}
}
• Published in
Neural Information Processing…
27 October 2019
• Computer Science
Decentralized optimization is emerging as a viable alternative for scal-able distributed machine learning, but also introduces new challenges in terms of synchronization costs. To this end, several communication-reduction techniques, such as non-blocking communication, quantization, and local steps, have been explored in the decentralized setting. Due to the complexity of analyzing optimization in such a relaxed setting, this line of work often assumes global communication rounds, which require…
9 Citations

## Figures and Tables from this paper

• Computer Science
ArXiv
• 2022
This paper proposes an algorithm incorporating decentralized distributed memory computing architecture with each node running multiprocessing parallel shared-memory SGD itself, and proves that the method guarantees ergodic convergence rates for non-convex objectives.
• Computer Science
ICML
• 2021
It is shown in theory that when the training consensus distance is lower than a critical quantity, decentralized training converges as fast as the centralized counterpart, and empirical insights allow the principled design of better decentralized training schemes that mitigate the performance drop.
• Computer Science
• 2022
A suite of techniques to optimize system efficiency holistically are proposed, including a highly general parameter-spilling design that enables large models to be trained even with a single GPU, a novel multi-query optimization scheme that blends model execution schedules efficiently and maximizes GPU utilization, and a double buffering idea to hide latency.
• Computer Science
ICC 2022 - IEEE International Conference on Communications
• 2022
This work proposes an asynchronous decentralized stochastic gradient descent algorithm, robust to the inherent computation and communication failures occurring at the wireless network edge, and theoretically analyze its performance and establishes a non-asymptotic convergence guarantee.
• Computer Science
ArXiv
• 2021
This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance.
• Computer Science
ArXiv
• 2022
This work jointly addresses two of the main practical challenges when scaling federated optimization to large node counts: the need for tight synchronization between the central authority and individual computing nodes, and the large communication cost of transmissions between thecentral server and clients.
• Computer Science
ArXiv
• 2022
Theoretically, it is proved that S WIFT matches the gold-standard iteration convergence rate O (1 / √ T ) of parallel stochastic gradient descent for convex and non-convex smooth optimization (total iterations T ), and theoretical results for IID andnon-IID settings without any bounded-delay assumption for slow clients which is required by other asynchronous decentralized FL algorithms.
• Computer Science
ArXiv
• 2022
This work essentially shows that, under reasonable parameter settings, a hybrid decentralized optimization system can not only withstand noisier zeroth-order agents, but can even benefit from integrating such agents into the optimization process, rather than ignoring their information.
• Computer Science
ICML
• 2022
It is proved that the consensus model learned by D-SGD is O ( m/N +1 /m + λ 2 ) -stable in expectation in the non-convex non-smooth setting, which is non-vacuous even when λ is closed to 1, in con-trast to vacuous as suggested by existing literature.
• Computer Science
ArXiv
• 2022
This work introduces a novel recursion based on “virtual iterates” and delay-adaptive stepsizes, which allow it to derive state-of-theart guarantees for both convex and non-convex objectives.

## References

SHOWING 1-10 OF 50 REFERENCES

• Computer Science
ICML
• 2020
This paper introduces a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been developed separately in various communities.
It is proved concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers andmini-batch size.
• Computer Science
ICML
• 2019
Stochastic Gradient Push is studied, it is proved that SGP converges to a stationary point of smooth, non-convex objectives at the same sub-linear rate as SGD, and that all nodes achieve consensus.
• Computer Science
ICML
• 2020
It is proved in theory that Moniqua communicates a provably bounded number of bits per iteration, while converging at the same asymptotic rate as the original algorithm does with full-precision communication.
This paper proposes an asynchronous decentralized stochastic gradient decent algorithm (AD-PSGD) satisfying all above expectations and is the first asynchronous algorithm that achieves a similar epoch-wise convergence rate as AllReduce-SGD, at an over 100-GPU scale.
This paper studies a D-PSGD algorithm and provides the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent.
• Computer Science
ArXiv
• 2018
This paper develops a framework of quantized, decentralized training and proposes two different strategies, which are called {\em extrapolation compression} and {\em difference compression], which both converge at the rate of $O(1/\sqrt{nT})$ where n is the number of workers and T is thenumber of iterations, matching the {\rc convergence} rate.
• Computer Science
Found. Trends Databases
• 2020
A brief introduction of some distributed learning techniques that have recently been developed, namely lossy communication compression (e.g., quantization and sparsification), asynchronous communication, and decentralized communication are provided.
• Computer Science
ICML
• 2019
This work presents a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate $\mathcal{O}\left(1/(nT) + 1/(T \delta^2 \omega)^2\right)$ for strongly convex objectives, where $T$ denotes the number of iterations and $\delta$ the eigengap of the connectivity matrix.
It is shown empirically that the method can reduce the amount of communication by three orders of magnitude while training a typical DNN for acoustic modelling, and enables efficient scaling to more parallel GPU nodes than any other method that is aware of.