Corpus ID: 226254755

Local SGD: Unified Theory and New Efficient Methods

@inproceedings{Gorbunov2021LocalSU,
  title={Local SGD: Unified Theory and New Efficient Methods},
  author={Eduard A. Gorbunov and Filip Hanzely and Peter Richt{\'a}rik},
  booktitle={AISTATS},
  year={2021}
}
This work was supported by the KAUST baseline research grant of P. Richt´arik. Part of this work was done while E. Gorbunov was a research intern at KAUST. The research of E. Gorbunov was also partially supported by the Ministry of Science and Higher Education of the Russian Federation (Goszadaniye) 075-00337-20-03 and RFBR, project number 19-31-51001. 

Figures and Tables from this paper

Basis Matters: Better Communication-Efficient Second Order Methods for Federated Learning
TLDR
This work designs a new Newton-type method (BL1), which reduces communication cost via both BL technique and bidirectional compression mechanism, and presents two alternative extensions to partial participation to accommodate federated learning applications. Expand
Inexact Tensor Methods and Their Application to Stochastic Convex Optimization
We propose a general non-accelerated tensor method under inexact information on higherorder derivatives, analyze its convergence rate, and provide sufficient conditions for this method to haveExpand
FedNL: Making Newton-Type Methods Applicable to Federated Learning
TLDR
This work proposes a family of Federated Newton Learn (FedNL) methods, which is a marked step in the direction of making second-order methods applicable to FL, and proves local convergence rates that are independent of the condition number, the number of training data points, and compression variance. Expand
An Operator Splitting View of Federated Learning
TLDR
This analysis reveals the vital role played by the step size in FL algorithms and shows that many of the existing FL algorithms can be understood from an operator splitting point of view, leading to a streamlined and economic way to accelerate FL algorithms, without incurring any communication overhead. Expand
An Accelerated Second-Order Method for Distributed Stochastic Optimization
We consider distributed stochastic optimization problems that are solved with master/workers computation architecture. Statistical arguments allow to exploit statistical similarity and approximateExpand
Secure Distributed Training at Scale
TLDR
This work proposes a novel protocol for secure (Byzantine-tolerant) decentralized training that emphasizes communication efficiency and rigorously analyzed this protocol provides theoretical bounds for its resistance against Byzantine and Sybil attacks and shows that it has a marginal communication overhead. Expand
Recent theoretical advances in decentralized distributed convex optimization.
TLDR
This paper focuses on how the results of decentralized distributed convex optimization can be explained based on optimal algorithms for the non-distributed setup, and provides recent results that have not been published yet. Expand
Reducing the Communication Cost of Federated Learning through Multistage Optimization
TLDR
A multistage optimization scheme that nearly matches the lower bound across all heterogeneity levels is proposed and its practical utility in image classification tasks is demonstrated. Expand
Efficient Algorithms for Federated Saddle Point Optimization
TLDR
This work designs an algorithm that can harness the benefit of similarity in the clients while recovering the Minibatch Mirror-prox performance under arbitrary heterogeneity (up to log factors) and gives the first federated minimax optimization algorithm that achieves this goal. Expand
FedPAGE: A Fast Local Stochastic Gradient Method for Communication-Efficient Federated Learning
TLDR
This work proposes a new federated learning algorithm, FedPAGE, able to further reduce the communication complexity by utilizing the recent optimal PAGE method (Li et al., 2021) instead of plain SGD in FedAvg. Expand
...
1
2
...

References

SHOWING 1-10 OF 64 REFERENCES
One Method to Rule Them All: Variance Reduction for Data, Parameters and Many New Methods
TLDR
This work proposes a remarkably general variance-reduced method suitable for solving regularized empirical risk minimization problems with either a large number of training examples, or a large model dimension, or both and provides a single theorem establishing linear convergence of the method under smoothness and quasi strong convexity assumptions. Expand
SGD: General Analysis and Improved Rates
TLDR
This theorem describes the convergence of an infinite array of variants of SGD, each of which is associated with a specific probability law governing the data selection rule used to form mini-batches, and can determine the mini-batch size that optimizes the total complexity. Expand
Distributed Learning with Compressed Gradient Differences
TLDR
This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates. Expand
The Convergence of Sparsified Gradient Methods
TLDR
It is proved that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD. Expand
Linearly Converging Error Compensated SGD
TLDR
A unified analysis of variants of distributed SGD with arbitrary compressions and delayed updates is proposed and the first method called EC-SGD-DIANA is proposed which is the first distributed stochastic method with error feedback and variance reduction that converges to the exact optimum asymptotically in expectation with a constant learning rate. Expand
Local SGD Converges Fast and Communicates Little
TLDR
It is proved concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers andmini-batch size. Expand
A Unified Theory of SGD: Variance Reduction, Sampling, Quantization and Coordinate Descent
TLDR
A unified analysis of a large family of variants of proximal stochastic gradient descent, which so far have required different intuitions, convergence analyses, have different applications, and which have been developed separately in various communities is introduced. Expand
Tighter Theory for Local SGD on Identical and Heterogeneous Data
We provide a new analysis of local SGD, removing unnecessary assumptions and elaborating on the difference between two data regimes: identical and heterogeneous. In both cases, we improve theExpand
Variance Reduced Stochastic Gradient Descent with Neighbors
TLDR
This paper investigates algorithms that can exploit neighborhood structure in the training data to share and re-use information about past stochastic gradients across data points, which offers advantages in the transient optimization phase. Expand
Communication-Efficient Distributed Optimization using an Approximate Newton-type Method
TLDR
A novel Newton-type method for distributed optimization, which is particularly well suited for stochastic optimization and learning problems, and which enjoys a linear rate of convergence which provably improves with the data size. Expand
...
1
2
3
4
5
...