• Corpus ID: 211572640

On Biased Compression for Distributed Learning

@article{Beznosikov2020OnBC,
  title={On Biased Compression for Distributed Learning},
  author={Aleksandr Beznosikov and Samuel Horvath and Peter Richt{\'a}rik and M. H. Safaryan},
  journal={ArXiv},
  year={2020},
  volume={abs/2002.12410}
}
In the last few years, various communication compression techniques have emerged as an indispensable tool helping to alleviate the communication bottleneck in distributed learning. However, despite the fact {\em biased} compressors often show superior performance in practice when compared to the much more studied and understood {\em unbiased} compressors, very little is known about them. In this work we study three classes of biased compression operators, two of which are new, and their… 

Figures and Tables from this paper

On Communication Compression for Distributed Optimization on Heterogeneous Data
TLDR
The results indicate that D-EF-SGD is much less affected than D-QSGD by non-iid data, but both methods can suffer a slowdown if data-skewness is high.
A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning
TLDR
This paper proposes a construction which can transform any contractive compressor into an induced unbiased compressor, and shows that this approach leads to vast improvements over EF, including reduced memory requirements, better communication complexity guarantees and fewer assumptions.
Shifted Compression Framework: Generalizations and Improvements
TLDR
This work develops a unified framework for studying lossy compression methods, which incorporates methods compressing both gradients and models, using unbiased and biased compressors, and sheds light on the construction of the auxiliary vectors.
Distributed Methods with Absolute Compression and Error Compensation
TLDR
The analysis of EC-SGD with absolute compression to the arbitrary sampling strategy is generalized and the rates improve upon the previously known ones in this setting and the proposed analysis ofEC-LSVRG withabsolute compression for (strongly) convex problems is proposed.
Lower Bounds and Nearly Optimal Algorithms in Distributed Learning with Communication Compression
TLDR
A convergence lower bound for algorithms whether using unbiased or contractive compressors in unidirection or bidirection is established and an algorithm is proposed, NEOLITHIC, which almost reaches the lower bound (up to logarithm factors) under mild conditions.
th Annual Workshop on Optimization for Machine Learning Error Compensated Distributed SGD can be Accelerated
TLDR
This work proposes and studies the error compensated loopless Katyusha method, and establishes an accelerated linear convergence rate under standard assumptions, and shows for the first time that error compensated gradient compression methods can be accelerated.
Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization
TLDR
It is proved that the proposed communication-efficient distributed adaptive gradient method converges to the first-order stationary point with the same iteration complexity as uncompressed vanilla AMSGrad in the stochastic nonconvex optimization setting.
Error Compensated Distributed SGD Can Be Accelerated
TLDR
This work proposes and studies the error compensated loopless Katyusha method, and establishes an accelerated linear convergence rate under standard assumptions, and shows for the first time that error compensated gradient compression methods can be accelerated.
Optimal Gradient Compression for Distributed and Federated Learning
TLDR
This paper investigates the fundamental trade-off between the number of bits needed to encode compressed vectors and the compression error, and introduces an efficient compression operator, Sparse Dithering, which naturally achieves the lower bound.
Distributed Newton-Type Methods with Communication Compression and Bernoulli Aggregation
TLDR
This work proves that the recently developed class of three point compressors (3PC) of Richtárik et al. can be generalized to Hessian communication as well, and discovered several new 3PC mechanisms, such as adaptive thresholding and Bernoulli aggregation, which require reduced communication and occasional Hessian computations.
...
...

References

SHOWING 1-10 OF 51 REFERENCES
Natural Compression for Distributed Deep Learning
TLDR
This work introduces a new, simple yet theoretically and practically effective compression technique: em natural compression (NC), which is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a "natural" way by ignoring the mantissa.
Stochastic Distributed Learning with Gradient Quantization and Variance Reduction
TLDR
These are the first methods that achieve linear convergence for arbitrary quantized updates in distributed optimization where the objective function is spread among different devices, each sending incremental model updates to a central server.
Global Momentum Compression for Sparse Communication in Distributed SGD
TLDR
This is the first work that proves the convergence of distributed momentum SGD~(DMSGD) with sparse communication and memory gradient, and theoretically prove the convergence rate of GMC for both convex and non-convex problems.
A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning
TLDR
This paper proposes a construction which can transform any contractive compressor into an induced unbiased compressor, and shows that this approach leads to vast improvements over EF, including reduced memory requirements, better communication complexity guarantees and fewer assumptions.
Sparsified SGD with Memory
TLDR
This work analyzes Stochastic Gradient Descent with k-sparsification or compression (for instance top-k or random-k) and shows that this scheme converges at the same rate as vanilla SGD when equipped with error compensation.
3LC: Lightweight and Effective Traffic Compression for Distributed Machine Learning
TLDR
3LC is presented, a lossy compression scheme for state change traffic that strikes balance between multiple goals: traffic reduction, accuracy, computation overhead, and generality.
Stochastic Sign Descent Methods: New Algorithms and Better Theory
TLDR
A new sign-based method is proposed, Stochastic Sign Descent with Momentum (SSDM), which converges under standard bounded variance assumption with the optimal asymptotic rate and is validated with numerical experiments.
Error Feedback Fixes SignSGD and other Gradient Compression Schemes
TLDR
It is proved that the algorithm EF-SGD with arbitrary compression operator achieves the same rate of convergence as SGD without any additional assumptions, and thus EF- SGD achieves gradient compression for free.
Sparse Gradient Compression for Distributed SGD
TLDR
The experiments over the sparse high-dimensional models and deep neural networks indicate that SGC can compress 99.99% gradients for every iteration without performance degradation, and saves the communication cost up to 48\(\times \).
Distributed Learning with Compressed Gradient Differences
TLDR
This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates.
...
...