Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent
@article{Alistarh2020ElasticCA, title={Elastic Consistency: A General Consistency Model for Distributed Stochastic Gradient Descent}, author={Dan Alistarh and Bapi Chatterjee and Vyacheslav Kungurtsev}, journal={ArXiv}, year={2020}, volume={abs/2001.05918} }
Machine learning has made tremendous progress in recent years, with models matching or even surpassing humans on a series of specialized tasks. One key element behind the progress of machine learning in recent years has been the ability to train machine learning models in large-scale distributed shared-memory and message-passing environments. Many of these models are trained employing variants of stochastic gradient descent (SGD) based optimization.
In this paper, we introduce a general…
9 Citations
MixML: A Unified Analysis of Weakly Consistent Parallel Learning
- Computer ScienceArXiv
- 2020
MixML is proposed, a general framework for analyzing convergence of weakly consistent parallel machine learning that recovers and improves on known convergence bounds for asynchronous and/or decentralized versions of many algorithms, includingSGD and AMSGrad.
Scaling the Wild: Decentralizing Hogwild!-style Shared-memory SGD
- Computer ScienceArXiv
- 2022
This paper proposes an algorithm incorporating decentralized distributed memory computing architecture with each node running multiprocessing parallel shared-memory SGD itself, and proves that the method guarantees ergodic convergence rates for non-convex objectives.
Elastic Consistency: A Practical Consistency Model for Distributed Stochastic Gradient Descent
- Computer ScienceAAAI
- 2021
A new synchronization-avoiding scheme for distributed SGD is proposed and analyzed, and it is shown that it can be used to efficiently train deep convolutional models for image classification.
Asynchronous Optimization Methods for Efficient Training of Deep Neural Networks with Guarantees
- Computer ScienceAAAI
- 2021
A new implementation strategy for shared-memory based training of deep neural networks, whereby concurrent parameter servers are utilized to train a partitioned but shared model in single- and multi-GPU settings without compromising accuracy is proposed.
Consistent Lock-free Parallel Stochastic Gradient Descent for Fast and Stable Convergence
- Computer Science2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
- 2021
Leashed-SGD is proposed, an extensible algorithmic framework of consistency-preserving implementations of AsyncSGD, employing lock-free synchronization, effectively balancing throughput and latency, and features a natural contention-regulating mechanism, as well as dynamic memory management, allocating space only when needed.
Optimal Complexity in Decentralized Training
- Computer ScienceICML
- 2021
DeTAG is proposed, a practical gossip-style decentralized algorithm that achieves the lower bound with only a logarithm gap, and it is shown DeTAG enjoys faster convergence compared to baselines, especially on unshuffled data and in sparse networks.
Project CGX: Algorithmic and System Support for Scalable Deep Learning on a Budget
- Computer Science
- 2021
This paper investigates whether the expensive hardware overprovisioning approach can be supplanted via algorithmic and system design, and proposes a framework called CGX, which provides efficient software support for communication compression, and shows that this framework is able to remove communication bottlenecks from consumer-grade multi-GPU systems, in the absence of hardware support.
Project CGX: Scalable Deep Learning on Commodity GPUs
- Computer ScienceArXiv
- 2021
This paper investigates whether the expensive hardware overprovisioning approach can be supplanted via algorithmic and system design, and proposes a framework called CGX, which provides efficient software support for communication compression, and is able to remove communication bottlenecks from consumer-grade multi-GPU systems, in the absence of hardware support.
Towards Optimal Convergence Rate in Decentralized Stochastic Training
- Computer ScienceArXiv
- 2020
A tight lower bound on the iteration complexity for such methods in a stochastic non-convex setting is provided and DeFacto is proposed, a class of algorithms that converge at the optimal rate without additional theoretical assumptions.
References
SHOWING 1-10 OF 69 REFERENCES
A generic communication scheduler for distributed DNN training acceleration
- Computer ScienceSOSP
- 2019
This work introduces a unified abstraction and a Dependency Proxy mechanism to enable communication scheduling without breaking the original dependencies in framework engines, and introduces a Bayesian Optimization approach to auto-tune tensor partition size and other parameters for different training models under various networking conditions.
The Error-Feedback Framework: Better Rates for SGD with Delayed Gradients and Compressed Communication
- Computer ScienceArXiv
- 2019
These results show that SGD is robust to compressed and/or delayed stochastic gradient updates and is in particular important for distributed parallel implementations, where asynchronous and communication efficient methods are the key to achieve linear speedups for optimization with multiple devices.
Taming the Wild: A Unified Analysis of Hogwild-Style Algorithms
- Computer ScienceNIPS
- 2015
This work uses a martingale-based analysis to derive convergence rates for the convex case (Hogwild!) with relaxed assumptions on the sparsity of the problem and designs and analyzes an asynchronous SGD algorithm, called Buckwild!, that uses lower-precision arithmetic.
Hogwild: A Lock-Free Approach to Parallelizing Stochastic Gradient Descent
- Computer ScienceNIPS
- 2011
This work aims to show using novel theoretical analysis, algorithms, and implementation that SGD can be implemented without any locking, and presents an update scheme called HOGWILD! which allows processors access to shared memory with the possibility of overwriting each other's work.
Error Feedback Fixes SignSGD and other Gradient Compression Schemes
- Computer ScienceICML
- 2019
It is proved that the algorithm EF-SGD with arbitrary compression operator achieves the same rate of convergence as SGD without any additional assumptions, and thus EF- SGD achieves gradient compression for free.
1-bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs
- Computer ScienceINTERSPEECH
- 2014
This work shows empirically that in SGD training of deep neural networks, one can, at no or nearly no loss of accuracy, quantize the gradients aggressively—to but one bit per value—if the quantization error is carried forward across minibatches (error feedback), and implements data-parallel deterministically distributed SGD by combining this finding with AdaGrad.
Distributed Computing: Fundamentals, Simulations and Advanced Topics
- Computer ScienceScalable Comput. Pract. Exp.
- 2001
Stephen J. Hartley first provides a complete explanation of the features of Java necessary to write concurrent programs, including topics such as exception handling, interfaces, and packages, and takes a different approach than most Java references.
The Convergence of Sparsified Gradient Methods
- Computer ScienceNeurIPS
- 2018
It is proved that, under analytic assumptions, sparsifying gradients by magnitude with local error correction provides convergence guarantees, for both convex and non-convex smooth objectives, for data-parallel SGD.
The Convergence of Stochastic Gradient Descent in Asynchronous Shared Memory
- Computer SciencePODC
- 2018
This work provides new convergence bounds for lock-free concurrent stochastic gradient descent, executing in the classic asynchronous shared memory model, against a strong adaptive adversary, and shows that this classic optimization tool can converge faster and with a wider range of parameters than previously known under asynchronous iterations.
QSGD: Communication-Efficient SGD via Gradient Quantization and Encoding
- Computer ScienceNIPS
- 2017
Quantized SGD is proposed, a family of compression schemes for gradient updates which provides convergence guarantees and leads to significant reductions in end-to-end training time, and can be extended to stochastic variance-reduced techniques.