• Corpus ID: 244948602

Asynchronous Decentralized SGD with Quantized and Local Updates

  title={Asynchronous Decentralized SGD with Quantized and Local Updates},
  author={Giorgi Nadiradze and Amirmojtaba Sabour and Peter Davies and Shigang Li and Dan Alistarh},
  booktitle={Neural Information Processing Systems},
Decentralized optimization is emerging as a viable alternative for scal-able distributed machine learning, but also introduces new challenges in terms of synchronization costs. To this end, several communication-reduction techniques, such as non-blocking communication, quantization, and local steps, have been explored in the decentralized setting. Due to the complexity of analyzing optimization in such a relaxed setting, this line of work often assumes global communication rounds, which require… 

Figures and Tables from this paper

Scaling the Wild: Decentralizing Hogwild!-style Shared-memory SGD

This paper proposes an algorithm incorporating decentralized distributed memory computing architecture with each node running multiprocessing parallel shared-memory SGD itself, and proves that the method guarantees ergodic convergence rates for non-convex objectives.

Consensus Control for Decentralized Deep Learning

It is shown in theory that when the training consensus distance is lower than a critical quantity, decentralized training converges as fast as the centralized counterpart, and empirical insights allow the principled design of better decentralized training schemes that mitigate the performance drop.

Hydra: An Optimized Data Systemfor Large Multi-Model Deep Learning [Information System Architectures]

A suite of techniques to optimize system efficiency holistically are proposed, including a highly general parameter-spilling design that enables large models to be trained even with a single GPU, a novel multi-query optimization scheme that blends model execution schedules efficiently and maximizes GPU utilization, and a double buffering idea to hide latency.

Asynchronous Decentralized Learning over Unreliable Wireless Networks

This work proposes an asynchronous decentralized stochastic gradient descent algorithm, robust to the inherent computation and communication failures occurring at the wireless network edge, and theoretically analyze its performance and establishes a non-asymptotic convergence guarantee.

A Field Guide to Federated Optimization

This paper provides recommendations and guidelines on formulating, designing, evaluating and analyzing federated optimization algorithms through concrete examples and practical implementation, with a focus on conducting effective simulations to infer real-world performance.

QuAFL: Federated Averaging Can Be Both Asynchronous and Communication-Efficient

This work jointly addresses two of the main practical challenges when scaling federated optimization to large node counts: the need for tight synchronization between the central authority and individual computing nodes, and the large communication cost of transmissions between thecentral server and clients.

SWIFT: Rapid Decentralized Federated Learning via Wait-Free Model Communication

Theoretically, it is proved that S WIFT matches the gold-standard iteration convergence rate O (1 / √ T ) of parallel stochastic gradient descent for convex and non-convex smooth optimization (total iterations T ), and theoretical results for IID andnon-IID settings without any bounded-delay assumption for slow clients which is required by other asynchronous decentralized FL algorithms.

Hybrid Decentralized Optimization: First- and Zeroth-Order Optimizers Can Be Jointly Leveraged For Faster Convergence

This work essentially shows that, under reasonable parameter settings, a hybrid decentralized optimization system can not only withstand noisier zeroth-order agents, but can even benefit from integrating such agents into the optimization process, rather than ignoring their information.

Topology-aware Generalization of Decentralized SGD

It is proved that the consensus model learned by D-SGD is O ( m/N +1 /m + λ 2 ) -stable in expectation in the non-convex non-smooth setting, which is non-vacuous even when λ is closed to 1, in con-trast to vacuous as suggested by existing literature.

Asynchronous SGD Beats Minibatch SGD Under Arbitrary Delays

This work introduces a novel recursion based on “virtual iterates” and delay-adaptive stepsizes, which allow it to derive state-of-theart guarantees for both convex and non-convex objectives.



A Unified Theory of Decentralized SGD with Changing Topology and Local Updates

This paper introduces a unified convergence analysis that covers a large variety of decentralized SGD methods which so far have required different intuitions, have different applications, and which have been developed separately in various communities.

Local SGD Converges Fast and Communicates Little

It is proved concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers andmini-batch size.

Stochastic Gradient Push for Distributed Deep Learning

Stochastic Gradient Push is studied, it is proved that SGP converges to a stationary point of smooth, non-convex objectives at the same sub-linear rate as SGD, and that all nodes achieve consensus.

Moniqua: Modulo Quantized Communication in Decentralized SGD

It is proved in theory that Moniqua communicates a provably bounded number of bits per iteration, while converging at the same asymptotic rate as the original algorithm does with full-precision communication.

Asynchronous Decentralized Parallel Stochastic Gradient Descent

This paper proposes an asynchronous decentralized stochastic gradient decent algorithm (AD-PSGD) satisfying all above expectations and is the first asynchronous algorithm that achieves a similar epoch-wise convergence rate as AllReduce-SGD, at an over 100-GPU scale.

Can Decentralized Algorithms Outperform Centralized Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient Descent

This paper studies a D-PSGD algorithm and provides the first theoretical analysis that indicates a regime in which decentralized algorithms might outperform centralized algorithms for distributed stochastic gradient descent.

Decentralization Meets Quantization

This paper develops a framework of quantized, decentralized training and proposes two different strategies, which are called {\em extrapolation compression} and {\em difference compression], which both converge at the rate of $O(1/\sqrt{nT})$ where n is the number of workers and T is thenumber of iterations, matching the {\rc convergence} rate.

Distributed Learning Systems with First-Order Methods

A brief introduction of some distributed learning techniques that have recently been developed, namely lossy communication compression (e.g., quantization and sparsification), asynchronous communication, and decentralized communication are provided.

Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication

This work presents a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate $\mathcal{O}\left(1/(nT) + 1/(T \delta^2 \omega)^2\right)$ for strongly convex objectives, where $T$ denotes the number of iterations and $\delta$ the eigengap of the connectivity matrix.

Scalable distributed DNN training using commodity GPU cloud computing

It is shown empirically that the method can reduce the amount of communication by three orders of magnitude while training a typical DNN for acoustic modelling, and enables efficient scaling to more parallel GPU nodes than any other method that is aware of.