# Tighter Theory for Local SGD on Identical and Heterogeneous Data

@inproceedings{Khaled2020TighterTF, title={Tighter Theory for Local SGD on Identical and Heterogeneous Data}, author={Ahmed Khaled and Konstantin Mishchenko and Peter Richt{\'a}rik}, booktitle={AISTATS}, year={2020} }

We provide a new analysis of local SGD, removing unnecessary assumptions and elaborating on the difference between two data regimes: identical and heterogeneous. In both cases, we improve the existing theory and provide values of the optimal stepsize and optimal number of local iterations. Our bounds are based on a new notion of variance that is specific to local SGD methods with different data. The tightness of our results is guaranteed by recovering known statements when we plug $H=1$, where…

## 128 Citations

Statistical Estimation and Inference via Local SGD in Federated Learning

- Computer Science, MathematicsArXiv
- 2021

The theoretical and empirical results show that the so-called Local SGD simultaneously achieves both statistical efficiency and communication efficiency.

Variance reduction in decentralized training over heterogeneous data

- 2021

Large-scale machine learning (ML) applications benefit from decentralized learning since it could execute parallel training and only needs to communicate with neighbors. However, comparing to exact…

MIME: MIMICKING CENTRALIZED STOCHASTIC AL-

- 2020

Federated learning (FL) is a challenging setting for optimization due to the heterogeneity of the data across different clients. This heterogeneity has been shown to cause a client drift, which can…

Local SGD Optimizes Overparameterized Neural Networks in Polynomial Time

- Computer Science, MathematicsArXiv
- 2021

It is proved that Local (S)GD (or FedAvg) can optimize two-layer neural networks with Rectified Linear Unit (ReLU) activation function in polynomial time, and will shed lights on the optimization theory of federated training of deep neural networks.

Federated Learning with Heterogeneous Data: A Superquantile Optimization Approach

- Computer Science, MathematicsArXiv
- 2021

This work presents a stochastic training algorithm which interleaves differentially private client reweighting steps with federated averaging steps that is supported with finite time convergence guarantees that cover both convex and non-convex settings.

On the Outsized Importance of Learning Rates in Local Update Methods

- Computer Science, MathematicsArXiv
- 2020

This work proves that for quadratic objectives, local update methods perform stochastic gradient descent on a surrogate loss function which it exactly characterize, and uses this theory to derive novel convergence rates for federated averaging that showcase this trade-off between the condition number of the surrogate loss and its alignment with the true loss function.

Byzantine-Resilient High-Dimensional SGD with Local Iterations on Heterogeneous Data

- Computer Science, MathematicsICML
- 2021

This work believes that its is the first Byzantine-resilient algorithm and analysis with local iterations in the presence of malicious/Byzantine clients and derives convergence results under minimal assumptions of bounded variance for SGD and bounded gradient dissimilarity in the statistical heterogeneous data setting.

Splitting Algorithms for Federated Learning

- 2021

Over the past few years, the federated learning (FL) community has witnessed a proliferation of new FL algorithms. However, our understating of the theory of FL is still fragmented, and a thorough,…

Cooperative SGD: A Unified Framework for the Design and Analysis of Local-Update SGD Algorithms

- 2021

When training machine learning models using stochastic gradient descent (SGD) with a large number of nodes or massive edge devices, the communication cost of synchronizing gradients at every…

Communication-efficient SGD: From Local SGD to One-Shot Averaging

- Computer Science, MathematicsArXiv
- 2021

A Local SGD scheme that communicates less overall by communicating less frequently as the number of iterations grows is suggested, and it is shown that Ω(N) communications are sufficient, and one-shot averaging which only uses a single round of communication can also achieve the optimal convergence rate asymptotically.

## References

SHOWING 1-10 OF 44 REFERENCES

Better Communication Complexity for Local SGD

- Computer ScienceArXiv
- 2019

This work revisits the local Stochastic Gradient Descent method (local SGD) method and proves new convergence rates, and improves upon the known requirement of Stich (2018) of synchronization times in total, where $T$ is the total number of iterations, which helps to explain the empirical success of local SGD.

SGD: General Analysis and Improved Rates

- Computer Science, MathematicsICML 2019
- 2019

This theorem describes the convergence of an infinite array of variants of SGD, each of which is associated with a specific probability law governing the data selection rule used to form mini-batches, and can determine the mini-batch size that optimizes the total complexity.

Local SGD with Periodic Averaging: Tighter Analysis and Adaptive Synchronization

- Computer Science, MathematicsNeurIPS
- 2019

This paper shows that for loss functions that satisfy the Polyak-Kojasiewicz condition, rounds of communication suffice to achieve a linear speed up, that is, an error of $O(1/pT)$, where $T$ is the total number of model updates at each worker.

First Analysis of Local GD on Heterogeneous Data

- Computer Science, MathematicsArXiv
- 2019

It is shown that in a low accuracy regime, the local gradient descent method has the same communication complexity as gradient descent.

Local SGD Converges Fast and Communicates Little

- Computer Science, MathematicsICLR
- 2019

It is proved concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers andmini-batch size.

Qsparse-Local-SGD: Distributed SGD With Quantization, Sparsification, and Local Computations

- Computer Science, MathematicsIEEE Journal on Selected Areas in Information Theory
- 2020

This paper proposes Qsparse-local-SGD algorithm, which combines aggressive sparsification with quantization and local computation along with error compensation, by keeping track of the difference between the true and compressed gradients, and demonstrates that it converges at the same rate as vanilla distributed SGD for many important classes of sparsifiers and quantizers.

On the Convergence of Local Descent Methods in Federated Learning

- Computer Science, MathematicsArXiv
- 2019

The obtained convergence rates are the sharpest known to date on the convergence of local decant methods with periodic averaging for solving nonconvex federated optimization in both centralized and networked distributed optimization.

Communication trade-offs for Local-SGD with large step size

- Computer Science, MathematicsNeurIPS
- 2019

A non-asymptotic error analysis is proposed, which enables comparison to one-shot averaging, and it is shown that local-SGD reduces communication by a factor of $O\Big(\frac{\sqrt{T}}{P^{3/2}}\Big)$ with $T$ the total number of gradients and machines.

Revisiting Stochastic Extragradient

- Mathematics, Computer ScienceAISTATS
- 2020

This work fixes a fundamental issue in the stochastic extragradient method by providing a new sampling strategy that is motivated by approximating implicit updates, and proves guarantees for solving variational inequality that go beyond existing settings.

Distributed Optimization for Over-Parameterized Learning

- Computer Science, MathematicsArXiv
- 2019

It is shown that the more local updating can reduce the overall communication, even for an infinity number of steps where each node is free to update its local model to near-optimality before exchanging information.