• Corpus ID: 119589180

Stochastic Distributed Learning with Gradient Quantization and Variance Reduction

@article{Horvath2019StochasticDL,
  title={Stochastic Distributed Learning with Gradient Quantization and Variance Reduction},
  author={Samuel Horv'ath and D. Kovalev and Konstantin Mishchenko and Sebastian U. Stich and Peter Richt{\'a}rik},
  journal={arXiv: Optimization and Control},
  year={2019}
}
We consider distributed optimization where the objective function is spread among different devices, each sending incremental model updates to a central server. To alleviate the communication bottleneck, recent work proposed various schemes to compress (e.g.\ quantize or sparsify) the gradients, thereby introducing additional variance $\omega \geq 1$ that might slow down convergence. For strongly convex functions with condition number $\kappa$ distributed among $n$ machines, we (i) give a… 

Figures from this paper

On Biased Compression for Distributed Learning
TLDR
It is shown for the first time that biased compressors can lead to linear convergence rates both in the single node and distributed settings, and a new highly performing biased compressor is proposed---combination of Top-k and natural dithering---which in the authors' experiments outperforms all other compression techniques.
Acceleration for Compressed Gradient Descent in Distributed and Federated Optimization
TLDR
This paper proposes the first accelerated compressed gradient descent (ACGD) methods and improves upon the existing non-accelerated rates and recovers the optimal rates of accelerated gradient descent as a special case when no compression is applied.
Decentralized Deep Learning with Arbitrary Communication Compression
TLDR
The use of communication compression in the decentralized training context achieves linear speedup in the number of workers and supports higher compression than previous state-of-the art methods.
Natural Compression for Distributed Deep Learning
TLDR
This work introduces a new, simple yet theoretically and practically effective compression technique: em natural compression (NC), which is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a "natural" way by ignoring the mantissa.
Differentially Quantized Gradient Methods
TLDR
The principle of differential quantization is introduced that prescribes compensating the past quantization errors to direct the descent trajectory of a quantized algorithm to- wards that of its unquantized counterpart and the differentially quantized heavy ball method attains the optimal contraction achievable among all (even unquantization) gradient methods.
A Better Alternative to Error Feedback for Communication-Efficient Distributed Learning
TLDR
This paper proposes a construction which can transform any contractive compressor into an induced unbiased compressor, and shows that this approach leads to vast improvements over EF, including reduced memory requirements, better communication complexity guarantees and fewer assumptions.
Lower Bounds and Nearly Optimal Algorithms in Distributed Learning with Communication Compression
TLDR
A convergence lower bound for algorithms whether using unbiased or contractive compressors in unidirection or bidirection is established and an algorithm is proposed, NEOLITHIC, which almost reaches the lower bound (up to logarithm factors) under mild conditions.
SGD with low-dimensional gradients with applications to private and distributed learning
TLDR
This paper designs an optimization algorithm that operates with the lower-dimensional (com-pressed) stochastic gradients, and establishes that with the right set of parameters it has the exact same dimension-free convergence guarantees as that of regular SGD for popular convex and nonconvex optimization settings.
A Double Residual Compression Algorithm for Efficient Distributed Learning
TLDR
The theoretical analyses demonstrate that the proposed strategy has superior convergence properties for both strongly convex and nonconvex objective functions and the experimental results validate that DORE achieves the best communication efficiency while maintaining similar model accuracy and convergence speed in comparison with start-of-the-art baselines.
vqSGD: Vector Quantized Stochastic Gradient Descent
TLDR
This work presents a family of vector quantization schemes that provide an asymptotic reduction in the communication cost with convergence guarantees in first-order distributed optimization and shows that vqSGD also offers automatic privacy guarantees.
...
...

References

SHOWING 1-10 OF 61 REFERENCES
Decentralized Stochastic Optimization and Gossip Algorithms with Compressed Communication
TLDR
This work presents a novel gossip-based stochastic gradient descent algorithm, CHOCO-SGD, that converges at rate $\mathcal{O}\left(1/(nT) + 1/(T \delta^2 \omega)^2\right)$ for strongly convex objectives, where $T$ denotes the number of iterations and $\delta$ the eigengap of the connectivity matrix.
Communication Compression for Decentralized Training
TLDR
This paper develops a framework of quantized, decentralized training and proposes two different strategies, which are called extrapolation compression and difference compression, which outperforms the best of merely decentralized and merely quantized algorithm significantly for networks with high latency and low bandwidth.
Distributed Mean Estimation with Limited Communication
TLDR
This work shows that applying a structured random rotation before quantization and a better coding strategy further reduces the error to O(1/n) and shows that the latter coding strategy is optimal up to a constant in the minimax sense i.e., it achieves the best MSE for a given communication cost.
Adding vs. Averaging in Distributed Primal-Dual Optimization
TLDR
A novel generalization of the recent communication-efficient primal-dual framework (COCOA) for distributed optimization, which allows for additive combination of local updates to the global parameters at each iteration, whereas previous schemes with convergence guarantees only allow conservative averaging.
QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
TLDR
Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques.
Distributed Learning with Compressed Gradient Differences
TLDR
This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates.
Local SGD Converges Fast and Communicates Little
TLDR
It is proved concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers andmini-batch size.
The Convergence of Sparsified Gradient Methods
TLDR
It is proved that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD.
Sparsified SGD with Memory
TLDR
This work analyzes Stochastic Gradient Descent with k-sparsification or compression (for instance top-k or random-k) and shows that this scheme converges at the same rate as vanilla SGD when equipped with error compensation.
Randomized Distributed Mean Estimation: Accuracy vs. Communication
TLDR
A flexible family of randomized algorithms exploring the trade-off between expected communication cost and estimation error is proposed, which contains the full-communication and zero-error method on one extreme, and an epsilon-bit communication and O(1/(epsilon n) error method on the opposite extreme.
...
...