• Corpus ID: 59316742

Distributed Learning with Compressed Gradient Differences

@article{Mishchenko2019DistributedLW,
  title={Distributed Learning with Compressed Gradient Differences},
  author={Konstantin Mishchenko and Eduard A. Gorbunov and Martin Tak{\'a}c and Peter Richt{\'a}rik},
  journal={ArXiv},
  year={2019},
  volume={abs/1901.09269}
}
Training large machine learning models requires a distributed computing approach, with communication of the model updates being the bottleneck. For this reason, several methods based on the compression (e.g., sparsification and/or quantization) of updates were recently proposed, including QSGD (Alistarh et al., 2017), TernGrad (Wen et al., 2017), SignSGD (Bernstein et al., 2018), and DQGD (Khirirat et al., 2018). However, none of these methods are able to learn the gradients, which renders them… 
Stochastic Sign Descent Methods: New Algorithms and Better Theory
TLDR
A new sign-based method is proposed, Stochastic Sign Descent with Momentum (SSDM), which converges under standard bounded variance assumption with the optimal asymptotic rate and is validated with numerical experiments.
Quantization for Distributed Optimization
TLDR
A set of all-reduce compatible gradient compression schemes which significantly reduce the communication overhead while maintaining the performance of vanilla SGD are presented.
IntML: Natural Compression for Distributed Deep Learning
TLDR
This work introduces a new, remarkably simple yet theoretically and practically effective compression technique, which it is called natural compression (Cnat), which is applied individually to all gradient values and works by randomized rounding to the nearest power of two.
Detached Error Feedback for Distributed SGD with Random Sparsification
TLDR
This work proposes a new detached error feedback (DEF) algorithm, which shows better convergence bound than error feedback for non-convex problems, and proposes DEF-A to accelerate the generalization of DEF at the early stages of the training, which showed better generalization bounds than DEF.
Natural Compression for Distributed Deep Learning
TLDR
This work introduces a new, simple yet theoretically and practically effective compression technique: em natural compression (NC), which is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a "natural" way by ignoring the mantissa.
Shifted Compression Framework: Generalizations and Improvements
TLDR
This work develops a unified framework for studying lossy compression methods, which incorporates methods compressing both gradients and models, using unbiased and biased compressors, and sheds light on the construction of the auxiliary vectors.
Unbiased Single-scale and Multi-scale Quantizers for Distributed Optimization
TLDR
This paper presents a set of all-reduce compatible gradient compression schemes which significantly reduce the communication overhead while maintaining the performance of vanilla SGD.
Peering Beyond the Gradient Veil with Distributed Auto Differentiation
TLDR
This work introduces an innovative, communication-friendly approach for training distributed DNNs, which capitalizes on the outer-product structure of the gradient as revealed by the mechanics of auto-differentiation, and demonstrates that dAD trains more efficiently than other state of the art distributed methods on modern architectures, such as transformers, when applied to large-scale text and imaging datasets.
Distributed Methods with Absolute Compression and Error Compensation
TLDR
The analysis of EC-SGD with absolute compression to the arbitrary sampling strategy is generalized and the rates improve upon the previously known ones in this setting and the proposed analysis ofEC-LSVRG withabsolute compression for (strongly) convex problems is proposed.
SGD with low-dimensional gradients with applications to private and distributed learning
TLDR
This paper designs an optimization algorithm that operates with the lower-dimensional (com-pressed) stochastic gradients, and establishes that with the right set of parameters it has the exact same dimension-free convergence guarantees as that of regular SGD for popular convex and nonconvex optimization settings.
...
...

References

SHOWING 1-10 OF 21 REFERENCES
signSGD: compressed optimisation for non-convex problems
TLDR
SignSGD can get the best of both worlds: compressed gradients and SGD-level convergence rate, and the momentum counterpart of signSGD is able to match the accuracy and convergence speed of Adam on deep Imagenet models.
An accelerated communication-efficient primal-dual optimization framework for structured machine learning
TLDR
An accelerated variant of CoCoA+ is proposed and shown to possess a convergence rate of in terms of reducing suboptimality, and the results of numerical experiments are provided to show that acceleration can lead to significant performance gains.
QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
TLDR
Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques.
Adding vs. Averaging in Distributed Primal-Dual Optimization
TLDR
A novel generalization of the recent communication-efficient primal-dual framework (COCOA) for distributed optimization, which allows for additive combination of local updates to the global parameters at each iteration, whereas previous schemes with convergence guarantees only allow conservative averaging.
TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning
TLDR
This work mathematically proves the convergence of TernGrad under the assumption of a bound on gradients, and proposes layer-wise ternarizing and gradient clipping to improve its convergence.
CoCoA: A General Framework for Communication-Efficient Distributed Optimization
TLDR
This work presents a general-purpose framework for distributed computing environments, CoCoA, that has an efficient communication scheme and is applicable to a wide variety of problems in machine learning and signal processing, and extends the framework to cover general non-strongly-convex regularizers, including L1-regularized problems like lasso.
Federated Learning: Strategies for Improving Communication Efficiency
TLDR
Two ways to reduce the uplink communication costs are proposed: structured updates, where the user directly learns an update from a restricted space parametrized using a smaller number of variables, e.g. either low-rank or a random mask; and sketched updates, which learn a full model update and then compress it using a combination of quantization, random rotations, and subsampling.
Communication-Efficient Distributed Dual Coordinate Ascent
TLDR
A communication-efficient framework that uses local computation in a primal-dual setting to dramatically reduce the amount of necessary communication is proposed, and a strong convergence rate analysis is provided for this class of algorithms.
DiSCO: Distributed Optimization for Self-Concordant Empirical Loss
TLDR
The algorithm is based on an inexact damped Newton method, where the inexact Newton steps are computed by a distributed preconditioned conjugate gradient method, and its iteration complexity and communication efficiency for minimizing self-concordant empirical loss functions are analyzed.
1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs
TLDR
This work shows empirically that in SGD training of deep neural networks, one can, at no or nearly no loss of accuracy, quantize the gradients aggressively—to but one bit per value—if the quantization error is carried forward across minibatches (error feedback), and implements data-parallel deterministically distributed SGD by combining this finding with AdaGrad.
...
...