Corpus ID: 52307874

Sparsified SGD with Memory

@inproceedings{Stich2018SparsifiedSW,
  title={Sparsified SGD with Memory},
  author={Sebastian U. Stich and Jean-Baptiste Cordonnier and Martin Jaggi},
  booktitle={NeurIPS},
  year={2018}
}
Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k… Expand
Understanding Top-k Sparsification in Distributed Deep Learning
TLDR
The property of gradient distribution is exploited to propose an approximate top-$k$ selection algorithm, which is computing-efficient for GPUs, to improve the scaling efficiency of TopK-SGD by significantly reducing the computing overhead. Expand
Natural Compression for Distributed Deep Learning
TLDR
This work introduces a new, simple yet theoretically and practically effective compression technique: em natural compression (NC), which is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a "natural" way by ignoring the mantissa. Expand
Distributed Sparse SGD with Majority Voting
TLDR
A novel majority voting based sparse communication strategy is introduced, in which the workers first seek a consensus on the structure of the sparse representation, which provides a significant reduction in the communication load and allows using the same sparsity level in both communication directions. Expand
Sparse Communication for Training Deep Networks
TLDR
This work studies several compression schemes and identifies how three key parameters affect the performance ofynchronous stochastic gradient descent and introduces a simple sparsification scheme, random-block sparsifiers, that reduces communication while keeping the performance close to standard SGD. Expand
Rethinking gradient sparsification as total error minimization
TLDR
This work identifies that the total error — the sum of the compression errors for all iterations — encapsulates sparsification throughout training and proposes a communication complexity model that minimizes the totalerror under a communication budget for the entire training. Expand
Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback
TLDR
A general distributed compressed SGD with Nesterov's momentum is proposed, which achieves the same testing accuracy as momentum SGD using full-precision gradients, but with $46\% less wall clock time. Expand
Global Momentum Compression for Sparse Communication in Distributed SGD
TLDR
This is the first work that proves the convergence of distributed momentum SGD~(DMSGD) with sparse communication and memory gradient, and theoretically prove the convergence rate of GMC for both convex and non-convex problems. Expand
Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees
TLDR
A new distributed optimization method named LAGS-SGD is proposed, which combines S- SGD with a novel layer-wise adaptive gradient sparsification (LAGS) scheme, which has convergence guarantees and it has the same order of convergence rate as vanilla S-SGd under a weak analytical assumption. Expand
Local SGD Converges Fast and Communicates Little
TLDR
It is proved concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers andmini-batch size. Expand
Compressing gradients by exploiting temporal correlation in momentum-SGD
An increasing bottleneck in decentralized optimization is communication. Bigger models and growing datasets mean that decentralization of computation is important and that the amount of informationExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 50 REFERENCES
Convex Optimization using Sparsified Stochastic Gradient Descent with Memory
TLDR
A sparsification scheme for SGD where only a small constant number of coordinates are applied at each iteration, which outperforms QSGD in progress per number of bits sent and opens the path to using lock-free asynchronous parallelization on dense problems. Expand
The Convergence of Sparsified Gradient Methods
TLDR
It is proved that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD. Expand
Local SGD Converges Fast and Communicates Little
TLDR
It is proved concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers andmini-batch size. Expand
QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
TLDR
Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques. Expand
Scalable distributed DNN training using commodity GPU cloud computing
TLDR
It is shown empirically that the method can reduce the amount of communication by three orders of magnitude while training a typical DNN for acoustic modelling, and enables efficient scaling to more parallel GPU nodes than any other method that is aware of. Expand
AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training
TLDR
This paper introduces a novel technique - the Adaptive Residual Gradient Compression (AdaComp) scheme, based on localized selection of gradient residues and automatically tunes the compression rate depending on local activity. Expand
Scaling SGD Batch Size to 32K for ImageNet Training
TLDR
Layer-wise Adaptive Rate Scaling (LARS) is proposed, a method to enable large-batch training to general networks or datasets, and it can scale the batch size to 32768 for ResNet50 and 8192 for AlexNet. Expand
meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting
TLDR
Surprisingly, experimental results demonstrate that the authors can update only 1-4% of the weights at each back propagation pass, and the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given. Expand
Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization
TLDR
This paper proposes the error compensated quantized stochastic gradient descent algorithm to improve the training efficiency, and presents theoretical analysis on the convergence behaviour, and demonstrates its advantage over competitors. Expand
Gradient Sparsification for Communication-Efficient Distributed Optimization
TLDR
This paper proposes a convex optimization formulation to minimize the coding length of stochastic gradients and experiments on regularized logistic regression, support vector machines, and convolutional neural networks validate the proposed approaches. Expand
...
1
2
3
4
5
...