MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling

@article{Wang2019MATCHASU,
  title={MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling},
  author={Jianyu Wang and Anit Kumar Sahu and Zhouyi Yang and Gauri Joshi and Soummya Kar},
  journal={2019 Sixth Indian Control Conference (ICC)},
  year={2019},
  pages={299-300}
}
Decentralized stochastic gradient descent (SGD) is a promising approach to learn a machine learning model over a network of workers connected in an arbitrary topology. Although a densely-connected network topology can ensure faster convergence in terms of iterations, it incurs more communication time/delay per iteration, resulting in longer training time. In this paper, we propose a novel algorithm MATCHA to achieve a win-win in this error-runtime trade-off. MATCHA uses matching decomposition… 
Exploring the Error-Runtime Trade-off in Decentralized Optimization
TLDR
Several variants of the MATCHA algorithm are proposed and it is shown that M MatchA can work with many other activation schemes and decentralized computation tasks and is a flexible framework to reduce the communication delay for free in decentralized environments.
Exponential Graph is Provably Efficient for Decentralized Deep Training
TLDR
This work proves so-called exponential graphs where every node is connected to O(log(n)) neighbors and n is the total number of nodes can lead to both fast communication and effective averaging simultaneously, and discovers that a sequence of log(n) one-peer exponential graphs can together achieve exact averaging.
SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum
TLDR
Inspired by the BMUF method, this work proposes a slow momentum (SlowMo) framework, where workers periodically synchronize and perform a momentum update, after multiple iterations of a base optimization algorithm, and provides theoretical convergence guarantees showing that SlowMo converges to a stationary point of smooth non-convex losses.
Communication-efficient SGD: From Local SGD to One-Shot Averaging
TLDR
A Local SGD scheme that communicates less overall by communicating less frequently as the number of iterations grows is suggested, and it is shown that Ω(N) communications are sufficient, and one-shot averaging which only uses a single round of communication can also achieve the optimal convergence rate asymptotically.
Finite-Time Consensus Learning for Decentralized Optimization with Nonlinear Gossiping
TLDR
A novel decentralized learning framework based on nonlinear gossiping (NGO), that enjoys an appealing finite-time consensus property to achieve better synchronization, is presented and its merits for modern distributed optimization applications, such as deep neural networks are discussed.
Communication Efficient Decentralized Training with Multiple Local Updates
TLDR
This work analyzes the Periodic Decentralized Stochastic Gradient Descent algorithm, a straightforward combination of federated averaging and decentralized SGD, and proves that PD-SGD converges to a critical point.
Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training
  • Zhe Zhang, Chuan Wu, Zongpeng Li
  • Computer Science
    IEEE INFOCOM 2021 - IEEE Conference on Computer Communications
  • 2021
TLDR
It is shown that the optimal parameter synchronization topology should be comprised of trees with different workers as roots, each for aggregating or broadcasting a partition of gradients/parameters, and near-optimal forest packing to maximally utilize available bandwidth and overlap aggregation and broadcast stages to minimize communication time.
Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms
When training machine learning models using stochastic gradient descent (SGD) with a large number of nodes or massive edge devices, the communication cost of synchronizing gradients at every
Communication-Efficient Federated Learning with Sketching
TLDR
This paper introduces a novel algorithm, called FedSketchedSGD, which compresses model updates using a Count Sketch, and then takes advantage of the mergeability of sketches to combine model updates from many workers.
Decentralized Deep Learning with Arbitrary Communication Compression
TLDR
The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 49 REFERENCES
Collaborative Deep Learning in Fixed Topology Networks
TLDR
This paper presents a new consensus-based distributed SGD (CDSGD) (and its momentum variant, CDMSGD) algorithm for collaborative deep learning over fixed topology networks that enables data parallelization as well as decentralized computation.
Local SGD Converges Fast and Communicates Little
TLDR
It is proved concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers andmini-batch size.
Parallel Restarted SGD for Non-Convex Optimization with Faster Convergence and Less Communication
TLDR
A thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.
Stochastic Gradient Push for Distributed Deep Learning
TLDR
Stochastic Gradient Push is studied, it is proved that SGP converges to a stationary point of smooth, non-convex objectives at the same sub-linear rate as SGD, and that all nodes achieve consensus.
Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD
TLDR
The main contribution is the design of AdaComm, an adaptive communication strategy that starts with infrequent averaging to save communication delay and improve convergence speed, and then increases the communication frequency in order to achieve a low error floor.
TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning
TLDR
This work mathematically proves the convergence of TernGrad under the assumption of a bound on gradients, and proposes layer-wise ternarizing and gradient clipping to improve its convergence.
Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication
TLDR
SBC combines existing techniques of communication delay and gradient sparsification with a novel binarization method and optimal weight update encoding to push compression gains to new limits to mitigate the limited communication bandwidth between contributing nodes or prohibitive communication cost for distributed training.
Slow and Stale Gradients Can Win the Race
TLDR
This work presents a novel theoretical characterization of the speed-up offered by asynchronous SGD methods by analyzing the trade-off between the error in the trained model and the actual training runtime (wallclock time).
LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning
TLDR
A new class of gradient methods for distributed machine learning that adaptively skip the gradient calculations to learn with reduced communication and computation is presented, justifying the acronym LAG used henceforth.
Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling
TLDR
This work develops and analyze distributed algorithms based on dual subgradient averaging and provides sharp bounds on their convergence rates as a function of the network size and topology, and shows that the number of iterations required by the algorithm scales inversely in the spectral gap of thenetwork.
...
1
2
3
4
5
...