• Corpus ID: 235377007

EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback

@inproceedings{Richtrik2021EF21AN,
  title={EF21: A New, Simpler, Theoretically Better, and Practically Faster Error Feedback},
  author={Peter Richt{\'a}rik and Igor Sokolov and Ilyas Fatkhullin},
  booktitle={NeurIPS},
  year={2021}
}
Error feedback (EF), also known as error compensation, is an immensely popular convergence stabilization mechanism in the context of distributed training of supervised machine learning models enhanced by the use of contractive communication compression mechanisms, such as Top-k. First proposed by Seide et al. [2014] as a heuristic, EF resisted any theoretical understanding until recently [Stich et al., 2018, Alistarh et al., 2018]. While these early breakthroughs were followed by a steady… 
EF-BV: A Unified Theory of Error Feedback and Variance Reduction Mechanisms for Biased and Unbiased Compression in Distributed Optimization
TLDR
The general approach works with a new, larger class of compressors, which includes unbiased and biased compressors as particular cases, and has two parameters, the bias and the variance.
3PC: Three Point Compressors for Communication-Efficient Distributed Training and a Better Theory for Lazy Aggregation
We propose and study a new class of gradient communication mechanisms for communicationefficient training—three point compressors (3PC)—as well as efficient distributed nonconvex optimization
Detached Error Feedback for Distributed SGD with Random Sparsification
TLDR
This work proposes a new detached error feedback (DEF) algorithm, which shows better convergence bound than error feedback for non-convex problems, and proposes DEF-A to accelerate the generalization of DEF at the early stages of the training, which showed better generalization bounds than DEF.
Distributed Methods with Absolute Compression and Error Compensation
TLDR
The analysis of EC-SGD with absolute compression to the arbitrary sampling strategy is generalized and the rates improve upon the previously known ones in this setting and the proposed analysis ofEC-LSVRG withabsolute compression for (strongly) convex problems is proposed.
Variance Reduction is an Antidote to Byzantines: Better Rates, Weaker Assumptions and Communication Compression as a Cherry on the Top
TLDR
Theoretical convergence guarantees are derived for Byz-VR-MARINA outperforming previous state-of-the-art for general non-convex and Polyak-Łojasiewicz loss functions and the first analysis of a Byzantine-tolerant method supporting non-uniform sampling of stochastic gradients.
Stochastic Gradient Descent-Ascent: Unified Theory and New Efficient Methods
TLDR
A unified convergence analysis that covers a large variety of stochastic gradient descent-ascent methods, which so far have required different intuitions, have different applications and have been developed separately in various communities is proposed.
Differentially Quantized Gradient Methods
TLDR
The principle of differential quantization is introduced that prescribes compensating the past quantization errors to direct the descent trajectory of a quantized algorithm to- wards that of its unquantized counterpart and the differentially quantized heavy ball method attains the optimal contraction achievable among all (even unquantization) gradient methods.
On Biased Compression for Distributed Learning
TLDR
It is shown for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings, and a new highly performing biased compressor is proposed---combination of Top-k and natural dithering---which in the authors' experiments outperforms all other compression techniques.
CANITA: Faster Rates for Distributed Convex Optimization with Communication Compression
TLDR
The results show that as long as the number of devices n is large, or the compression ω is not very high, CANITA achieves the faster convergence rate, which improves upon the state-of-the-art non-accelerated rate.
Lower Bounds and Nearly Optimal Algorithms in Distributed Learning with Communication Compression
TLDR
A convergence lower bound for algorithms whether using unbiased or contractive compressors in unidirection or bidirection is established and an algorithm is proposed, NEOLITHIC, which almost reaches the lower bound (up to logarithm factors) under mild conditions.
...
...

References

SHOWING 1-10 OF 42 REFERENCES
Distributed Learning with Compressed Gradient Differences
TLDR
This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates.
Error Feedback Fixes SignSGD and other Gradient Compression Schemes
TLDR
It is proved that the algorithm EF-SGD with arbitrary compression operator achieves the same rate of convergence as SGD without any additional assumptions, and thus EF- SGD achieves gradient compression for free.
Error Compensated Distributed SGD Can Be Accelerated
TLDR
This work proposes and studies the error compensated loopless Katyusha method, and establishes an accelerated linear convergence rate under standard assumptions, and shows for the first time that error compensated gradient compression methods can be accelerated.
The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication
TLDR
These results show that SGD is robust to compressed and/or delayed stochastic gradient updates and is in particular important for distributed parallel implementations, where asynchronous and communication efficient methods are the key to achieve linear speedups for optimization with multiple devices.
Sparsified SGD with Memory
TLDR
This work analyzes Stochastic Gradient Descent with k-sparsification or compression (for instance top-k or random-k) and shows that this scheme converges at the same rate as vanilla SGD when equipped with error compensation.
Convex Optimization using Sparsified Stochastic Gradient Descent with Memory
TLDR
A sparsification scheme for SGD where only a small constant number of coordinates are applied at each iteration, which outperforms QSGD in progress per number of bits sent and opens the path to using lock-free asynchronous parallelization on dense problems.
The Convergence of Sparsified Gradient Methods
TLDR
It is proved that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD.
QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
TLDR
Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques.
Natural Compression for Distributed Deep Learning
TLDR
This work introduces a new, simple yet theoretically and practically effective compression technique: em natural compression (NC), which is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a "natural" way by ignoring the mantissa.
Linearly Converging Error Compensated SGD
TLDR
A unified analysis of variants of distributed SGD with arbitrary compressions and delayed updates is proposed and the first method called EC-SGD-DIANA is proposed which is the first distributed stochastic method with error feedback and variance reduction that converges to the exact optimum asymptotically in expectation with a constant learning rate.
...
...