• Corpus ID: 52307874

Sparsified SGD with Memory

  title={Sparsified SGD with Memory},
  author={Sebastian U. Stich and Jean-Baptiste Cordonnier and Martin Jaggi},
Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k… 

Figures and Tables from this paper

Detached Error Feedback for Distributed SGD with Random Sparsification

This work proposes a new detached error feedback (DEF) algorithm, which shows better convergence bound than error feedback for non-convex problems, and proposes DEF-A to accelerate the generalization of DEF at the early stages of the training, which showed better generalization bounds than DEF.

AC-SGD: Adaptively Compressed SGD for Communication-Efficient Distributed Learning

This paper proposes a novel Adaptively-Compressed Stochastic Gradient Descent (AC-SGD) strategy to adjust the number of quantization bits and the sparsification size with respect to the norm of gradients, the communication budget, and the remaining number of iterations, and derives an upper bound of the convergence error for arbitrary dynamic compression strategy.

Understanding Top-k Sparsification in Distributed Deep Learning

The property of gradient distribution is exploited to propose an approximate top-$k$ selection algorithm, which is computing-efficient for GPUs, to improve the scaling efficiency of TopK-SGD by significantly reducing the computing overhead.

LR-SGD: Layer-based Random SGD For Distributed Deep Learning

An efficient sparsification method, layer-based random SGD (LR-SGD), that randomly select a certain number of layers of the DNN model to be exchanged instead of some elements of each tensor, which reduces communication while keep the performance close to the SSGD.

Natural Compression for Distributed Deep Learning

This work introduces a new, simple yet theoretically and practically effective compression technique: em natural compression (NC), which is applied individually to all entries of the to-be-compressed update vector and works by randomized rounding to the nearest (negative or positive) power of two, which can be computed in a "natural" way by ignoring the mantissa.

Distributed Methods with Absolute Compression and Error Compensation

The analysis of EC-SGD with absolute compression to the arbitrary sampling strategy is generalized and the rates improve upon the previously known ones in this setting and the proposed analysis ofEC-LSVRG withabsolute compression for (strongly) convex problems is proposed.

Distributed Sparse SGD with Majority Voting

A novel majority voting based sparse communication strategy is introduced, in which the workers first seek a consensus on the structure of the sparse representation, which provides a significant reduction in the communication load and allows using the same sparsity level in both communication directions.

Sparse Communication for Training Deep Networks

This work studies several compression schemes and identifies how three key parameters affect the performance ofynchronous stochastic gradient descent and introduces a simple sparsification scheme, random-block sparsifiers, that reduces communication while keeping the performance close to standard SGD.

Rethinking gradient sparsification as total error minimization

This work identifies that the total error — the sum of the compression errors for all iterations — encapsulates sparsification throughout training, and proposes a communication complexity model that minimizes the totalerror under a communication budget for the entire training.

Communication-Efficient Distributed Blockwise Momentum SGD with Error-Feedback

A general distributed compressed SGD with Nesterov's momentum is proposed, which achieves the same testing accuracy as momentum SGD using full-precision gradients, but with $46\% less wall clock time.



Convex Optimization using Sparsified Stochastic Gradient Descent with Memory

A sparsification scheme for SGD where only a small constant number of coordinates are applied at each iteration, which outperforms QSGD in progress per number of bits sent and opens the path to using lock-free asynchronous parallelization on dense problems.

The Convergence of Sparsified Gradient Methods

It is proved that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD.

Local SGD Converges Fast and Communicates Little

It is proved concise convergence rates for local SGD on convex problems and show that it converges at the same rate as mini-batch SGD in terms of number of evaluated gradients, that is, the scheme achieves linear speedup in the number of workers andmini-batch size.

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques.

Scalable distributed DNN training using commodity GPU cloud computing

It is shown empirically that the method can reduce the amount of communication by three orders of magnitude while training a typical DNN for acoustic modelling, and enables efficient scaling to more parallel GPU nodes than any other method that is aware of.

Scaling SGD Batch Size to 32K for ImageNet Training

Layer-wise Adaptive Rate Scaling (LARS) is proposed, a method to enable large-batch training to general networks or datasets, and it can scale the batch size to 32768 for ResNet50 and 8192 for AlexNet.

meProp: Sparsified Back Propagation for Accelerated Deep Learning with Reduced Overfitting

Surprisingly, experimental results demonstrate that the authors can update only 1-4% of the weights at each back propagation pass, and the accuracy of the resulting models is actually improved rather than degraded, and a detailed analysis is given.

Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization

This paper proposes the error compensated quantized stochastic gradient descent algorithm to improve the training efficiency, and presents theoretical analysis on the convergence behaviour, and demonstrates its advantage over competitors.

Gradient Sparsification for Communication-Efficient Distributed Optimization

This paper proposes a convex optimization formulation to minimize the coding length of stochastic gradients and experiments on regularized logistic regression, support vector machines, and convolutional neural networks validate the proposed approaches.

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

This work mathematically proves the convergence of TernGrad under the assumption of a bound on gradients, and proposes layer-wise ternarizing and gradient clipping to improve its convergence.