• Corpus ID: 238419384

EF21 with Bells & Whistles: Practical Algorithmic Extensions of Modern Error Feedback

@article{Fatkhullin2021EF21WB,
  title={EF21 with Bells \& Whistles: Practical Algorithmic Extensions of Modern Error Feedback},
  author={Ilyas Fatkhullin and Igor Sokolov and Eduard A. Gorbunov and Zhize Li and Peter Richt{\'a}rik},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.03294}
}
First proposed by Seide et al. (2014) as a heuristic, error feedback (EF) is a very popular mechanism for enforcing convergence of distributed gradient-based optimization methods enhanced with communication compression strategies based on the application of contractive compression operators. However, existing theory of EF relies on very strong assumptions (e.g., bounded gradients), and provides pessimistic convergence rates (e.g., while the best known rate for EF in the smooth nonconvex regime… 

EF-BV: A Unified Theory of Error Feedback and Variance Reduction Mechanisms for Biased and Unbiased Compression in Distributed Optimization

The general approach works with a new, larger class of compressors, which has two parameters, the bias and the variance, and includes unbiased and biased compressors as particular cases and proves its linear convergence under certain conditions.

3PC: Three Point Compressors for Communication-Efficient Distributed Training and a Better Theory for Lazy Aggregation

We propose and study a new class of gradient communication mechanisms for communicationefficient training—three point compressors (3PC)—as well as efficient distributed nonconvex optimization

EF21-P and Friends: Improved Theoretical Communication Complexity for Distributed Optimization with Bidirectional Compression

This work employs EF21-P as the mechanism for compressing and subsequently error-correcting the model broadcast by the server to the workers, and obtains novel methods supporting bidirectional compression and enjoying new state-of-the-art theoretical communication complexity for convex and nonconvex problems.

Detached Error Feedback for Distributed SGD with Random Sparsification

This work proposes a new detached error feedback (DEF) algorithm, which shows better convergence bound than error feedback for non-convex problems, and proposes DEF-A to accelerate the generalization of DEF at the early stages of the training, which showed better generalization bounds than DEF.

Communication Acceleration of Local Gradient Methods via an Accelerated Primal-Dual Algorithm with Inexact Prox

The general results offer the new state-of-the-art rates for the class of strongly convex-concave saddle-point problems with bilinear coupling characterized by the absence of smoothness in the dual function.

Lower Bounds and Nearly Optimal Algorithms in Distributed Learning with Communication Compression

A convergence lower bound for algorithms whether using unbiased or contractive compressors in unidirection or bidirection is established and an algorithm is proposed, NEOLITHIC, which almost reaches the lower bound (up to logarithm factors) under mild conditions.

Adaptive Compression for Communication-Efficient Distributed Training

The AdaCGD mechanism is a theoretically-grounded multi-adaptive communication compression mechanism that extends the 3PC framework to bidirectional compression, and provides sharp convergence bounds in the strongly convex, convex and nonconvex settings.

Simple and Optimal Stochastic Gradient Methods for Nonsmooth Nonconvex Optimization

This work proposes and analyzes several stochastic gradient algorithms for finding stationary points or local minimum in nonconvex, possibly with nonsmooth regularizer, finite-sum and online optimization problems, and proposes an optimal algorithm, called SSRGD, based on SARAH, which can find an -approximate (first-order) stationary point by simply adding some random perturbations.

BEER: Fast O(1/T) Rate for Decentralized Nonconvex Optimization with Communication Compression

This paper proposes BEER, which adopts communication compression with gradient tracking, and shows it converges at a faster rate of O (1 /T ) than the state-of-the-art rate, by matching the rate without compression even under arbitrary data heterogeneity.

On Biased Compression for Distributed Learning

It is shown for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings, and a new highly performing biased compressor is proposed---combination of Top-k and natural dithering---which in the authors' experiments outperforms all other compression techniques.

References

SHOWING 1-10 OF 27 REFERENCES

A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning

This paper proposes a construction which can transform any contractive compressor into an induced unbiased compressor, and shows that this approach leads to vast improvements over EF, including reduced memory requirements, better communication complexity guarantees and fewer assumptions.

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques.

Sparsified SGD with Memory

This work analyzes Stochastic Gradient Descent with k-sparsification or compression (for instance top-k or random-k) and shows that this scheme converges at the same rate as vanilla SGD when equipped with error compensation.

DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression

This work provides a detailed analysis on this two-pass communication model and its asynchronous parallel variant, with error-compensated compression both on the worker nodes and on the parameter server, and admits three very nice properties: it is compatible with an arbitrary compression technique, it admits an improved convergence rate and it admits linear speedup with respect to the number of workers.

Linearly Converging Error Compensated SGD

A unified analysis of variants of distributed SGD with arbitrary compressions and delayed updates is proposed and the first method called EC-SGD-DIANA is proposed which is the first distributed stochastic method with error feedback and variance reduction that converges to the exact optimum asymptotically in expectation with a constant learning rate.

The Convergence of Sparsified Gradient Methods

It is proved that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD.

Distributed Second Order Methods with Fast Rates and Compressed Communication

Several new communication-efficient second-order methods for distributed optimization, including a stochastic sparsification strategy for learning the unknown parameters in an iterative fashion in a communication efficient manner, and a globalization strategy using cubic regularization.

CSER: Communication-efficient SGD with Error Reset

This work introduces a new technique called "error reset" that adapts arbitrary compressors for SGD, producing bifurcated local models with periodic reset of resulting local residual errors, and proves the convergence of CSER for smooth non-convex problems.

MARINA: Faster Non-Convex Distributed Learning with Compression

The MARINA method is a new communication efficient method for non-convex distributed learning over heterogeneous datasets based on a carefully designed biased gradient estimator, which is the key to its superior theoretical and practical performance.

On Biased Compression for Distributed Learning

It is shown for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings, and a new highly performing biased compressor is proposed---combination of Top-k and natural dithering---which in the authors' experiments outperforms all other compression techniques.