• Corpus ID: 210701015

Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent

@article{Alistarh2020ElasticCA,
  title={Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent},
  author={Dan Alistarh and Bapi Chatterjee and Vyacheslav Kungurtsev},
  journal={ArXiv},
  year={2020},
  volume={abs/2001.05918}
}
Machine learning has made tremendous progress in recent years, with models matching or even surpassing humans on a series of specialized tasks. One key element behind the progress of machine learning in recent years has been the ability to train machine learning models in large-scale distributed shared-memory and message-passing environments. Many of these models are trained employing variants of stochastic gradient descent (SGD) based optimization. In this paper, we introduce a general… 

Figures and Tables from this paper

MixML: A Unified Analysis of Weakly Consistent Parallel Learning
TLDR
MixML is proposed, a general framework for analyzing convergence of weakly consistent parallel machine learning that recovers and improves on known convergence bounds for asynchronous and/or decentralized versions of many algorithms, includingSGD and AMSGrad.
Scaling the Wild: Decentralizing Hogwild!-style Shared-memory SGD
TLDR
This paper proposes an algorithm incorporating decentralized distributed memory computing architecture with each node running multiprocessing parallel shared-memory SGD itself, and proves that the method guarantees ergodic convergence rates for non-convex objectives.
Elastic Consistency: A Practical Consistency Model for Distributed Stochastic Gradient Descent
TLDR
A new synchronization-avoiding scheme for distributed SGD is proposed and analyzed, and it is shown that it can be used to efficiently train deep convolutional models for image classification.
Asynchronous Optimization Methods for Efficient Training of Deep Neural Networks with Guarantees
TLDR
A new implementation strategy for shared-memory based training of deep neural networks, whereby concurrent parameter servers are utilized to train a partitioned but shared model in single- and multi-GPU settings without compromising accuracy is proposed.
Consistent Lock-free Parallel Stochastic Gradient Descent for Fast and Stable Convergence
TLDR
Leashed-SGD is proposed, an extensible algorithmic framework of consistency-preserving implementations of AsyncSGD, employing lock-free synchronization, effectively balancing throughput and latency, and features a natural contention-regulating mechanism, as well as dynamic memory management, allocating space only when needed.
Optimal Complexity in Decentralized Training
TLDR
DeTAG is proposed, a practical gossip-style decentralized algorithm that achieves the lower bound with only a logarithm gap, and it is shown DeTAG enjoys faster convergence compared to baselines, especially on unshuffled data and in sparse networks.
Project CGX: Algorithmic and System Support for Scalable Deep Learning on a Budget
TLDR
This paper investigates whether the expensive hardware overprovisioning approach can be supplanted via algorithmic and system design, and proposes a framework called CGX, which provides efficient software support for communication compression, and shows that this framework is able to remove communication bottlenecks from consumer-grade multi-GPU systems, in the absence of hardware support.
Project CGX: Scalable Deep Learning on Commodity GPUs
TLDR
This paper investigates whether the expensive hardware overprovisioning approach can be supplanted via algorithmic and system design, and proposes a framework called CGX, which provides efficient software support for communication compression, and is able to remove communication bottlenecks from consumer-grade multi-GPU systems, in the absence of hardware support.
Towards Optimal Convergence Rate in Decentralized Stochastic Training
TLDR
A tight lower bound on the iteration complexity for such methods in a stochastic non-convex setting is provided and DeFacto is proposed, a class of algorithms that converge at the optimal rate without additional theoretical assumptions.

References

SHOWING 1-10 OF 69 REFERENCES
A generic communication scheduler for distributed DNN training acceleration
TLDR
This work introduces a unified abstraction and a Dependency Proxy mechanism to enable communication scheduling without breaking the original dependencies in framework engines, and introduces a Bayesian Optimization approach to auto-tune tensor partition size and other parameters for different training models under various networking conditions.
The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication
TLDR
These results show that SGD is robust to compressed and/or delayed stochastic gradient updates and is in particular important for distributed parallel implementations, where asynchronous and communication efficient methods are the key to achieve linear speedups for optimization with multiple devices.
Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms
TLDR
This work uses a martingale-based analysis to derive convergence rates for the convex case (Hogwild!) with relaxed assumptions on the sparsity of the problem and designs and analyzes an asynchronous SGD algorithm, called Buckwild!, that uses lower-precision arithmetic.
Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
TLDR
This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking, and presents an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work.
Error Feedback Fixes SignSGD and other Gradient Compression Schemes
TLDR
It is proved that the algorithm EF-SGD with arbitrary compression operator achieves the same rate of convergence as SGD without any additional assumptions, and thus EF- SGD achieves gradient compression for free.
1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs
TLDR
This work shows empirically that in SGD training of deep neural networks, one can, at no or nearly no loss of accuracy, quantize the gradients aggressively—to but one bit per value—if the quantization error is carried forward across minibatches (error feedback), and implements data-parallel deterministically distributed SGD by combining this finding with AdaGrad.
Distributed Computing: Fundamentals, Simulations and Advanced Topics
  • M. Paprzycki
  • Computer Science
    Scalable Comput. Pract. Exp.
  • 2001
TLDR
Stephen J. Hartley first provides a complete explanation of the features of Java necessary to write concurrent programs, including topics such as exception handling, interfaces, and packages, and takes a different approach than most Java references.
The Convergence of Sparsified Gradient Methods
TLDR
It is proved that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD.
The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory
TLDR
This work provides new convergence bounds for lock-free concurrent stochastic gradient descent, executing in the classic asynchronous shared memory model, against a strong adaptive adversary, and shows that this classic optimization tool can converge faster and with a wider range of parameters than previously known under asynchronous iterations.
QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
TLDR
Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques.
...
1
2
3
4
5
...