# Local SGD: Unified Theory and New Efficient Methods

@inproceedings{Gorbunov2021LocalSU, title={Local SGD: Unified Theory and New Efficient Methods}, author={Eduard A. Gorbunov and Filip Hanzely and Peter Richt{\'a}rik}, booktitle={AISTATS}, year={2021} }

This work was supported by the KAUST baseline research grant of P. Richt´arik. Part of this work was done while E. Gorbunov was a research intern at KAUST. The research of E. Gorbunov was also partially supported by the Ministry of Science and Higher Education
of the Russian Federation (Goszadaniye) 075-00337-20-03 and RFBR, project number 19-31-51001.

#### Figures and Tables from this paper

#### 17 Citations

Basis Matters: Better Communication-Efficient Second Order Methods for Federated Learning

- Computer Science, Mathematics
- ArXiv
- 2021

This work designs a new Newton-type method (BL1), which reduces communication cost via both BL technique and bidirectional compression mechanism, and presents two alternative extensions to partial participation to accommodate federated learning applications. Expand

Inexact Tensor Methods and Their Application to Stochastic Convex Optimization

- Mathematics
- 2020

We propose a general non-accelerated tensor method under inexact information on higherorder derivatives, analyze its convergence rate, and provide sufficient conditions for this method to have… Expand

FedNL: Making Newton-Type Methods Applicable to Federated Learning

- Computer Science, Mathematics
- ArXiv
- 2021

This work proposes a family of Federated Newton Learn (FedNL) methods, which is a marked step in the direction of making second-order methods applicable to FL, and proves local convergence rates that are independent of the condition number, the number of training data points, and compression variance. Expand

An Operator Splitting View of Federated Learning

- Computer Science
- ArXiv
- 2021

This analysis reveals the vital role played by the step size in FL algorithms and shows that many of the existing FL algorithms can be understood from an operator splitting point of view, leading to a streamlined and economic way to accelerate FL algorithms, without incurring any communication overhead. Expand

An Accelerated Second-Order Method for Distributed Stochastic Optimization

- Mathematics
- 2021

We consider distributed stochastic optimization problems that are solved with master/workers computation architecture. Statistical arguments allow to exploit statistical similarity and approximate… Expand

Secure Distributed Training at Scale

- Computer Science, Mathematics
- ArXiv
- 2021

This work proposes a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency and rigorously analyzed this protocol provides theoretical bounds for its resistance against Byzantine and Sybil attacks and shows that it has a marginal communication overhead. Expand

Recent theoretical advances in decentralized distributed convex optimization.

- Mathematics, Computer Science
- 2020

This paper focuses on how the results of decentralized distributed convex optimization can be explained based on optimal algorithms for the non-distributed setup, and provides recent results that have not been published yet. Expand

Reducing the Communication Cost of Federated Learning through Multistage Optimization

- Computer Science, Mathematics
- ArXiv
- 2021

A multistage optimization scheme that nearly matches the lower bound across all heterogeneity levels is proposed and its practical utility in image classification tasks is demonstrated. Expand

Efficient Algorithms for Federated Saddle Point Optimization

- Computer Science, Mathematics
- ArXiv
- 2021

This work designs an algorithm that can harness the benefit of similarity in the clients while recovering the Minibatch Mirror-prox performance under arbitrary heterogeneity (up to log factors) and gives the first federated minimax optimization algorithm that achieves this goal. Expand

FedPAGE: A Fast Local Stochastic Gradient Method for Communication-Efficient Federated Learning

- Computer Science, Mathematics
- ArXiv
- 2021

This work proposes a new federated learning algorithm, FedPAGE, able to further reduce the communication complexity by utilizing the recent optimal PAGE method (Li et al., 2021) instead of plain SGD in FedAvg. Expand

#### References

SHOWING 1-10 OF 64 REFERENCES

One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods

- Computer Science, Mathematics
- ArXiv
- 2019

This work proposes a remarkably general variance-reduced method suitable for solving regularized empirical risk minimization problems with either a large number of training examples, or a large model dimension, or both and provides a single theorem establishing linear convergence of the method under smoothness and quasi strong convexity assumptions. Expand

SGD: General Analysis and Improved Rates

- Computer Science, Mathematics
- ICML 2019
- 2019

This theorem describes the convergence of an infinite array of variants of SGD, each of which is associated with a specific probability law governing the data selection rule used to form mini-batches, and can determine the mini-batch size that optimizes the total complexity. Expand

Distributed Learning with Compressed Gradient Differences

- Computer Science, Mathematics
- ArXiv
- 2019

This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates. Expand

The Convergence of Sparsified Gradient Methods

- Computer Science, Mathematics
- NeurIPS
- 2018

It is proved that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD. Expand

Linearly Converging Error Compensated SGD

- Computer Science, Mathematics
- NeurIPS
- 2020

A unified analysis of variants of distributed SGD with arbitrary compressions and delayed updates is proposed and the first method called EC-SGD-DIANA is proposed which is the first distributed stochastic method with error feedback and variance reduction that converges to the exact optimum asymptotically in expectation with a constant learning rate. Expand

Local SGD Converges Fast and Communicates Little

- Computer Science, Mathematics
- ICLR
- 2019

It is proved concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers andmini-batch size. Expand

A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent

- Mathematics, Computer Science
- AISTATS
- 2020

A unified analysis of a large family of variants of proximal stochastic gradient descent, which so far have required different intuitions, convergence analyses, have different applications, and which have been developed separately in various communities is introduced. Expand

Tighter Theory for Local SGD on Identical and Heterogeneous Data

- Computer Science, Mathematics
- AISTATS
- 2020

We provide a new analysis of local SGD, removing unnecessary assumptions and elaborating on the difference between two data regimes: identical and heterogeneous. In both cases, we improve the… Expand

Variance Reduced Stochastic Gradient Descent with Neighbors

- Computer Science, Mathematics
- NIPS
- 2015

This paper investigates algorithms that can exploit neighborhood structure in the training data to share and re-use information about past stochastic gradients across data points, which offers advantages in the transient optimization phase. Expand

Communication-Efficient Distributed Optimization using an Approximate Newton-type Method

- Computer Science, Mathematics
- ICML
- 2014

A novel Newton-type method for distributed optimization, which is particularly well suited for stochastic optimization and learning problems, and which enjoys a linear rate of convergence which provably improves with the data size. Expand