• Corpus ID: 247594371

IntSGD: Adaptive Floatless Compression of Stochastic Gradients

@inproceedings{Mishchenko2021IntSGDAF,
  title={IntSGD: Adaptive Floatless Compression of Stochastic Gradients},
  author={Konstantin Mishchenko and Bokun Wang and D. Kovalev and Peter Richt{\'a}rik},
  year={2021}
}
We propose a family of adaptive integer compression operators for distributed Stochastic Gradient Descent (SGD) that do not communicate a single float. This is achieved by multiplying floating-point vectors with a number known to every device and then rounding to integers. In contrast to the prior work on integer compression for SwitchML by Sapio et al. (2021), our IntSGD method is provably convergent and computationally cheaper as it estimates the scaling of vectors adaptively. Our theory… 

Figures and Tables from this paper

References

SHOWING 1-10 OF 31 REFERENCES
QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
TLDR
Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques.
Sparsified SGD with Memory
TLDR
This work analyzes Stochastic Gradient Descent with k-sparsification or compression (for instance top-k or random-k) and shows that this scheme converges at the same rate as vanilla SGD when equipped with error compensation.
Natural Compression for Distributed Deep Learning
TLDR
This work introduces a new, simple yet theoretically and practically effective compression technique: em natural compression (NC), which is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a "natural" way by ignoring the mantissa.
Error Feedback Fixes SignSGD and other Gradient Compression Schemes
TLDR
It is proved that the algorithm EF-SGD with arbitrary compression operator achieves the same rate of convergence as SGD without any additional assumptions, and thus EF- SGD achieves gradient compression for free.
Distributed Learning with Compressed Gradient Differences
TLDR
This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates.
SGD: General Analysis and Improved Rates
TLDR
This theorem describes the convergence of an infinite array of variants of SGD, each of which is associated with a specific probability law governing the data selection rule used to form mini-batches, and can determine the mini-batch size that optimizes the total complexity.
signSGD: compressed optimisation for non-convex problems
TLDR
SignSGD can get the best of both worlds: compressed gradients and SGD-level convergence rate, and the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models.
Distributed learning with compressed gradients
TLDR
A unified analysis framework for distributed gradient methods operating with staled and compressed gradients is presented and non-asymptotic bounds on convergence rates and information exchange are derived for several optimization algorithms.
On Biased Compression for Distributed Learning
TLDR
It is shown for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings, and a new highly performing biased compressor is proposed---combination of Top-k and natural dithering---which in the authors' experiments outperforms all other compression techniques.
On the Utility of Gradient Compression in Distributed Training Systems
TLDR
This work evaluates the efficacy of gradient compression methods and compare their scalability with optimized implementations of synchronous data-parallel SGD, and proposes a list of desirable properties that gradient compressed methods should satisfy in order to provide a meaningful end-to-end speedup.
...
...