# MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling

@article{Wang2019MATCHASU, title={MATCHA: Speeding Up Decentralized SGD via Matching Decomposition Sampling}, author={Jianyu Wang and Anit Kumar Sahu and Zhouyi Yang and Gauri Joshi and Soummya Kar}, journal={2019 Sixth Indian Control Conference (ICC)}, year={2019}, pages={299-300} }

Decentralized stochastic gradient descent (SGD) is a promising approach to learn a machine learning model over a network of workers connected in an arbitrary topology. Although a densely-connected network topology can ensure faster convergence in terms of iterations, it incurs more communication time/delay per iteration, resulting in longer training time. In this paper, we propose a novel algorithm MATCHA to achieve a win-win in this error-runtime trade-off. MATCHA uses matching decomposition…

## Figures and Topics from this paper

## 56 Citations

Exploring the Error-Runtime Trade-off in Decentralized Optimization

- Computer Science2020 54th Asilomar Conference on Signals, Systems, and Computers
- 2020

Several variants of the MATCHA algorithm are proposed and it is shown that M MatchA can work with many other activation schemes and decentralized computation tasks and is a flexible framework to reduce the communication delay for free in decentralized environments.

Exponential Graph is Provably Efficient for Decentralized Deep Training

- Computer Science, MathematicsArXiv
- 2021

This work proves so-called exponential graphs where every node is connected to O(log(n)) neighbors and n is the total number of nodes can lead to both fast communication and effective averaging simultaneously, and discovers that a sequence of log(n) one-peer exponential graphs can together achieve exact averaging.

SlowMo: Improving Communication-Efficient Distributed SGD with Slow Momentum

- Computer Science, MathematicsICLR
- 2020

Inspired by the BMUF method, this work proposes a slow momentum (SlowMo) framework, where workers periodically synchronize and perform a momentum update, after multiple iterations of a base optimization algorithm, and provides theoretical convergence guarantees showing that SlowMo converges to a stationary point of smooth non-convex losses.

Communication-efficient SGD: From Local SGD to One-Shot Averaging

- Computer Science, MathematicsArXiv
- 2021

A Local SGD scheme that communicates less overall by communicating less frequently as the number of iterations grows is suggested, and it is shown that Ω(N) communications are sufficient, and one-shot averaging which only uses a single round of communication can also achieve the optimal convergence rate asymptotically.

Finite-Time Consensus Learning for Decentralized Optimization with Nonlinear Gossiping

- Computer ScienceArXiv
- 2021

A novel decentralized learning framework based on nonlinear gossiping (NGO), that enjoys an appealing finite-time consensus property to achieve better synchronization, is presented and its merits for modern distributed optimization applications, such as deep neural networks are discussed.

Communication Efficient Decentralized Training with Multiple Local Updates

- Mathematics, Computer ScienceArXiv
- 2019

This work analyzes the Periodic Decentralized Stochastic Gradient Descent algorithm, a straightforward combination of federated averaging and decentralized SGD, and proves that PD-SGD converges to a critical point.

Near-Optimal Topology-adaptive Parameter Synchronization in Distributed DNN Training

- Computer ScienceIEEE INFOCOM 2021 - IEEE Conference on Computer Communications
- 2021

It is shown that the optimal parameter synchronization topology should be comprised of trees with different workers as roots, each for aggregating or broadcasting a partition of gradients/parameters, and near-optimal forest packing to maximally utilize available bandwidth and overlap aggregation and broadcast stages to minimize communication time.

Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms

- 2021

When training machine learning models using stochastic gradient descent (SGD) with a large number of nodes or massive edge devices, the communication cost of synchronizing gradients at every…

Communication-Efficient Federated Learning with Sketching

- Computer ScienceICML 2020
- 2020

This paper introduces a novel algorithm, called FedSketchedSGD, which compresses model updates using a Count Sketch, and then takes advantage of the mergeability of sketches to combine model updates from many workers.

Decentralized Deep Learning with Arbitrary Communication Compression

- Computer Science, MathematicsICLR
- 2020

The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods.

## References

SHOWING 1-10 OF 49 REFERENCES

Collaborative Deep Learning in Fixed Topology Networks

- Computer Science, MathematicsNIPS
- 2017

This paper presents a new consensus-based distributed SGD (CDSGD) (and its momentum variant, CDMSGD) algorithm for collaborative deep learning over fixed topology networks that enables data parallelization as well as decentralized computation.

Local SGD Converges Fast and Communicates Little

- Computer Science, MathematicsICLR
- 2019

It is proved concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers andmini-batch size.

Parallel Restarted SGD for Non-Convex Optimization with Faster Convergence and Less Communication

- Computer ScienceArXiv
- 2018

A thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.

Stochastic Gradient Push for Distributed Deep Learning

- Computer Science, MathematicsICML
- 2019

Stochastic Gradient Push is studied, it is proved that SGP converges to a stationary point of smooth, non-convex objectives at the same sub-linear rate as SGD, and that all nodes achieve consensus.

Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD

- Computer Science, MathematicsMLSys
- 2019

The main contribution is the design of AdaComm, an adaptive communication strategy that starts with infrequent averaging to save communication delay and improve convergence speed, and then increases the communication frequency in order to achieve a low error floor.

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

- Computer Science, MathematicsNIPS
- 2017

This work mathematically proves the convergence of TernGrad under the assumption of a bound on gradients, and proposes layer-wise ternarizing and gradient clipping to improve its convergence.

Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication

- Computer Science, Mathematics2019 International Joint Conference on Neural Networks (IJCNN)
- 2019

SBC combines existing techniques of communication delay and gradient sparsification with a novel binarization method and optimal weight update encoding to push compression gains to new limits to mitigate the limited communication bandwidth between contributing nodes or prohibitive communication cost for distributed training.

Slow and Stale Gradients Can Win the Race

- Computer Science, MathematicsIEEE Journal on Selected Areas in Information Theory
- 2021

This work presents a novel theoretical characterization of the speed-up offered by asynchronous SGD methods by analyzing the trade-off between the error in the trained model and the actual training runtime (wallclock time).

LAG: Lazily Aggregated Gradient for Communication-Efficient Distributed Learning

- Computer Science, MathematicsNeurIPS
- 2018

A new class of gradient methods for distributed machine learning that adaptively skip the gradient calculations to learn with reduced communication and computation is presented, justifying the acronym LAG used henceforth.

Dual Averaging for Distributed Optimization: Convergence Analysis and Network Scaling

- Mathematics, Computer ScienceIEEE Transactions on Automatic Control
- 2012

This work develops and analyze distributed algorithms based on dual subgradient averaging and provides sharp bounds on their convergence rates as a function of the network size and topology, and shows that the number of iterations required by the algorithm scales inversely in the spectral gap of thenetwork.