# Sparsified SGD with Memory

@inproceedings{Stich2018SparsifiedSW, title={Sparsified SGD with Memory}, author={Sebastian U. Stich and Jean-Baptiste Cordonnier and Martin Jaggi}, booktitle={NeurIPS}, year={2018} }

Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k… Expand

#### Supplemental Code

#### 275 Citations

Understanding Top-k Sparsification in Distributed Deep Learning

- Computer Science, Mathematics
- ArXiv
- 2019

The property of gradient distribution is exploited to propose an approximate top-$k$ selection algorithm, which is computing-efficient for GPUs, to improve the scaling efficiency of TopK-SGD by significantly reducing the computing overhead. Expand

Natural Compression for Distributed Deep Learning

- Computer Science, Mathematics
- ArXiv
- 2019

This work introduces a new, simple yet theoretically and practically effective compression technique: em natural compression (NC), which is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a "natural" way by ignoring the mantissa. Expand

Distributed Sparse SGD with Majority Voting

- Computer Science, Mathematics
- ArXiv
- 2020

A novel majority voting based sparse communication strategy is introduced, in which the workers first seek a consensus on the structure of the sparse representation, which provides a significant reduction in the communication load and allows using the same sparsity level in both communication directions. Expand

Sparse Communication for Training Deep Networks

- Computer Science, Mathematics
- ArXiv
- 2020

This work studies several compression schemes and identifies how three key parameters affect the performance ofynchronous stochastic gradient descent and introduces a simple sparsification scheme, random-block sparsifiers, that reduces communication while keeping the performance close to standard SGD. Expand

Rethinking gradient sparsification as total error minimization

- Computer Science, Mathematics
- ArXiv
- 2021

This work identifies that the total error — the sum of the compression errors for all iterations — encapsulates sparsification throughout training and proposes a communication complexity model that minimizes the totalerror under a communication budget for the entire training. Expand

Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback

- Computer Science, Mathematics
- NeurIPS
- 2019

A general distributed compressed SGD with Nesterov's momentum is proposed, which achieves the same testing accuracy as momentum SGD using full-precision gradients, but with $46\% less wall clock time. Expand

Global Momentum Compression for Sparse Communication in Distributed SGD

- Computer Science, Mathematics
- ArXiv
- 2019

This is the first work that proves the convergence of distributed momentum SGD~(DMSGD) with sparse communication and memory gradient, and theoretically prove the convergence rate of GMC for both convex and non-convex problems. Expand

Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

- Computer Science, Mathematics
- ECAI
- 2020

A new distributed optimization method named LAGS-SGD is proposed, which combines S- SGD with a novel layer-wise adaptive gradient sparsification (LAGS) scheme, which has convergence guarantees and it has the same order of convergence rate as vanilla S-SGd under a weak analytical assumption. Expand

Local SGD Converges Fast and Communicates Little

- Computer Science, Mathematics
- ICLR
- 2019

It is proved concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers andmini-batch size. Expand

Compressing gradients by exploiting temporal correlation in momentum-SGD

- Computer Science
- ArXiv
- 2021

An increasing bottleneck in decentralized optimization is communication. Bigger models and growing datasets mean that decentralization of computation is important and that the amount of information… Expand

#### References

SHOWING 1-10 OF 50 REFERENCES

Convex Optimization using Sparsified Stochastic Gradient Descent with Memory

- Computer Science
- 2018

A sparsification scheme for SGD where only a small constant number of coordinates are applied at each iteration, which outperforms QSGD in progress per number of bits sent and opens the path to using lock-free asynchronous parallelization on dense problems. Expand

The Convergence of Sparsified Gradient Methods

- Computer Science, Mathematics
- NeurIPS
- 2018

It is proved that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD. Expand

Local SGD Converges Fast and Communicates Little

- Computer Science, Mathematics
- ICLR
- 2019

It is proved concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers andmini-batch size. Expand

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

- Computer Science
- NIPS
- 2017

Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques. Expand

Scalable distributed DNN training using commodity GPU cloud computing

- Computer Science
- INTERSPEECH
- 2015

It is shown empirically that the method can reduce the amount of communication by three orders of magnitude while training a typical DNN for acoustic modelling, and enables efficient scaling to more parallel GPU nodes than any other method that is aware of. Expand

AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training

- Computer Science, Mathematics
- AAAI
- 2018

This paper introduces a novel technique - the Adaptive Residual Gradient Compression (AdaComp) scheme, based on localized selection of gradient residues and automatically tunes the compression rate depending on local activity. Expand

Scaling SGD Batch Size to 32K for ImageNet Training

- Computer Science
- ArXiv
- 2017

Layer-wise Adaptive Rate Scaling (LARS) is proposed, a method to enable large-batch training to general networks or datasets, and it can scale the batch size to 32768 for ResNet50 and 8192 for AlexNet. Expand

meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting

- Computer Science, Mathematics
- ICML
- 2017

Surprisingly, experimental results demonstrate that the authors can update only 1-4% of the weights at each back propagation pass, and the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given. Expand

Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization

- Computer Science
- ICML
- 2018

This paper proposes the error compensated quantized stochastic gradient descent algorithm to improve the training efficiency, and presents theoretical analysis on the convergence behaviour, and demonstrates its advantage over competitors. Expand

Gradient Sparsification for Communication-Efficient Distributed Optimization

- Computer Science, Mathematics
- NeurIPS
- 2018

This paper proposes a convex optimization formulation to minimize the coding length of stochastic gradients and experiments on regularized logistic regression, support vector machines, and convolutional neural networks validate the proposed approaches. Expand