• Corpus ID: 7763588

signSGD: compressed optimisation for non-convex problems

  title={signSGD: compressed optimisation for non-convex problems},
  author={Jeremy Bernstein and Yu-Xiang Wang and Kamyar Azizzadenesheli and Anima Anandkumar},
Training large neural networks requires distributing learning across multiple workers, where the cost of communicating gradients can be a significant bottleneck. signSGD alleviates this problem by transmitting just the sign of each minibatch stochastic gradient. We prove that it can get the best of both worlds: compressed gradients and SGD-level convergence rate. signSGD can exploit mismatches between L1 and L2 geometry: when noise and curvature are much sparser than the gradients, signSGD is… 

Figures and Tables from this paper

Sparse-SignSGD with Majority Vote for Communication-Efficient Distributed Learning

Experimental results using both independent and identically distributed (IID) and non-IID datasets demonstrate that the ${\sf S}^3$GD-MV attains higher accuracy than signSGD, significantly reducing communication costs.

Stochastic Sign Descent Methods: New Algorithms and Better Theory

A new sign-based method is proposed, Stochastic Sign Descent with Momentum (SSDM), which converges under standard bounded variance assumption with the optimal asymptotic rate and is validated with numerical experiments.

Lossy Gradient Compression: How Much Accuracy Can One Bit Buy?

A rate-distortion approach to address the compressor design problem for the distributed training of deep neural networks (DNNs) and proposes a class of distortion measures to aid the design of quantizers for the compression of the model updates.

Distributed Learning with Compressed Gradient Differences

This work proposes a new distributed learning method --- DIANA --- which resolves issues via compression of gradient differences, and performs a theoretical analysis in the strongly convex and nonconvex settings and shows that its rates are superior to existing rates.

Quantizing data for distributed learning

The convergence of the proposed approach for smooth convex and non-convex objective functions is analyzed, and it is shown that it can achieve order optimal convergence rates with communication that mostly depends on the data rather than the model ( gradient) dimension.

Efficient-Adam: Communication-Efficient Distributed Adam with Complexity Analysis

This work introduces a novel communication-efficient distributed Adam in the parameter-server model for stochastic nonconvex optimization, and incorporates a two-way quantization scheme into Eflcient-Adam to reduce the communication cost between the workers and server.

signSGD with Majority Vote is Communication Efficient And Byzantine Fault Tolerant

A particularly simple algorithm for robust, communication-efficient learning---signSGD, which proves that unlike SGD, majority vote is robust when up to 50% of workers behave adversarially, and built the distributed training system in Pytorch.

On Large-Batch Training of Residual Networks with SignSGD

Large-batch training of deep neural networks (DNN) has recently been widely studied, since traversing the optimization landscape is faster with large batches and the emergence of parallel computing

Quantized Compressive Sampling of Stochastic Gradients for Efficient Communication in Distributed Deep Learning

Quantized Compressive Sampling (QCS) of SG is proposed that addresses the above two issues while achieving an arbitrarily large compression gain and develops and analyzes a method to both control the overall variance of the compressed SG and prevent the staleness of the updates.

Maximizing Communication Efficiency for Large-scale Training via 0/1 Adam

The non-linearity in Adam causes slow convergence even when 1-bit compression or local steps are individually applied, so 0/1 Adam is proposed that linearizes each Adam step via approximating its optimizer states using their stale estimates and linear correlation.



TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

This work mathematically proves the convergence of TernGrad under the assumption of a bound on gradients, and proposes layer-wise ternarizing and gradient clipping to improve its convergence.

Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding

Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques.

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

This paper empirically show that on the ImageNet dataset large minibatches cause optimization difficulties, but when these are addressed the trained networks exhibit good generalization and enable training visual recognition models on internet-scale data with high efficiency.

A Bayesian Perspective on Generalization and Stochastic Gradient Descent

It is proposed that the noise introduced by small mini-batches drives the parameters towards minima whose evidence is large, and it is demonstrated that, when one holds the learning rate fixed, there is an optimum batch size which maximizes the test set accuracy.

The Marginal Value of Adaptive Gradient Methods in Machine Learning

It is observed that the solutions found by adaptive methods generalize worse (often significantly worse) than SGD, even when these solutions have better training performance, suggesting that practitioners should reconsider the use of adaptive methods to train neural networks.

Scalable distributed DNN training using commodity GPU cloud computing

It is shown empirically that the method can reduce the amount of communication by three orders of magnitude while training a typical DNN for acoustic modelling, and enables efficient scaling to more parallel GPU nodes than any other method that is aware of.

Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients

This analysis extends recent results on adverse effects of ADAM on generalization, isolating the sign aspect as the problematic one and transfers the variance adaptation to SGD gives rise to a novel method, completing the practitioner's toolbox for problems where ADAM fails.

1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs

This work shows empirically that in SGD training of deep neural networks, one can, at no or nearly no loss of accuracy, quantize the gradients aggressively—to but one bit per value—if the quantization error is carried forward across minibatches (error feedback), and implements data-parallel deterministically distributed SGD by combining this finding with AdaGrad.

Fixing Weight Decay Regularization in Adam

This work decouples the optimal choice of weight decay factor from the setting of the learning rate for both standard SGD and Adam and substantially improves Adam's generalization performance, allowing it to compete with SGD with momentum on image classification datasets.