Corpus ID: 198953329

Taming Momentum in a Distributed Asynchronous Environment

@article{Hakimi2019TamingMI,
  title={Taming Momentum in a Distributed Asynchronous Environment},
  author={Ido Hakimi and Saar Barkai and Moshe Gabel and A. Schuster},
  journal={ArXiv},
  year={2019},
  volume={abs/1907.11612}
}
Although distributed computing can significantly reduce the training time of deep neural networks, scaling the training process while maintaining high efficiency and final accuracy is challenging. Distributed asynchronous training enjoys near-linear speedup, but asynchrony causes gradient staleness, the main difficulty in scaling stochastic gradient descent to large clusters. Momentum, which is often used to accelerate convergence and escape local minima, exacerbates the gradient staleness… Expand
Gap Aware Mitigation of Gradient Staleness
TLDR
This paper defines the Gap as a measure of gradient staleness and proposes Gap-Aware (GA), a novel asynchronous-distributed method that penalizes stale gradients linearly to the Gap and performs well even when scaling to large numbers of workers. Expand
At Stability's Edge: How to Adjust Hyperparameters to Preserve Minima Selection in Asynchronous Training of Neural Networks?
TLDR
This work examines asynchronous training from the perspective of dynamical stability and finds that the degree of delay interacts with the learning rate, to change the set of minima accessible by an asynchronous stochastic gradient descent algorithm. Expand
Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism
TLDR
FTPipe is a system that explores a previously unexplored dimension of pipeline model parallelism, making multi-GPU execution of fine-tuning tasks for giant neural networks readily accessible, and achieves up to 3× speedup and state-of-the-art accuracy when fine- Tuning giant transformers with billions of parameters. Expand
FedAdapt: Adaptive Offloading for IoT Devices in Federated Learning
TLDR
FedAdapt, an adaptive offloading FL framework that accelerates local training in computationally constrained devices by leveraging layer offloading of deep neural networks (DNNs) to servers, and is demonstrated to reduce the training time by up to 40% when compared to classic FL, without sacrificing accuracy. Expand
Pipelined Backpropagation at Scale: Training Large Models without Batches
TLDR
This work evaluates the use of small batch, fine-grained Pipelined Backpropagation, an asynchronous pipeline parallel training algorithm that has significant hardware advantages and introduces two methods, Spike Compensation and Linear Weight Prediction, that effectively mitigate the downsides caused by the asynchronicity of Pipeline Backpropaganda and outperform existing techniques in this setting. Expand
ShadowSync: Performing Synchronization in the Background for Highly Scalable Distributed Training
TLDR
The ShadowSync framework is proposed, in which the model parameters are synchronized across workers, yet it is isolate synchronization from training and run it in the background, which accomplishes the highest example level parallelism number comparing to the prior arts. Expand

References

SHOWING 1-10 OF 71 REFERENCES
Gap Aware Mitigation of Gradient Staleness
TLDR
This paper defines the Gap as a measure of gradient staleness and proposes Gap-Aware (GA), a novel asynchronous-distributed method that penalizes stale gradients linearly to the Gap and performs well even when scaling to large numbers of workers. Expand
Asynchrony begets momentum, with an application to deep learning
TLDR
It is shown that running stochastic gradient descent in an asynchronous manner can be viewed as adding a momentum-like term to the SGD iteration, and an important implication is that tuning the momentum parameter is important when considering different levels of asynchrony. Expand
Slow and Stale Gradients Can Win the Race
TLDR
This work presents a novel theoretical characterization of the speed-up offered by asynchronous SGD methods by analyzing the trade-off between the error in the trained model and the actual training runtime (wallclock time). Expand
Asynchronous Stochastic Gradient Descent with Delay Compensation
TLDR
The proposed algorithm is evaluated on CIFAR-10 and ImageNet datasets, and the experimental results demonstrate that DC-ASGD outperforms both synchronous SGD and asynchronous SGD, and nearly approaches the performance of sequential SGD. Expand
DoubleSqueeze: Parallel Stochastic Gradient Descent with Double-Pass Error-Compensated Compression
TLDR
This work provides a detailed analysis on this two-pass communication model and its asynchronous parallel variant, with error-compensated compression both on the worker nodes and on the parameter server, and admits three very nice properties: it is compatible with an arbitrary compression technique, it admits an improved convergence rate and it admits linear speedup with respect to the number of workers. Expand
Distributed Asynchronous Optimization with Unbounded Delays: How Slow Can You Go?
TLDR
It is shown that it is possible to amortize delays and achieve global convergence with probability 1, even under polynomially growing delays, reaffirming the successful application of DASGD to large-scale optimization problems. Expand
Adaptive Communication Strategies to Achieve the Best Error-Runtime Trade-off in Local-Update SGD
TLDR
The main contribution is the design of AdaComm, an adaptive communication strategy that starts with infrequent averaging to save communication delay and improve convergence speed, and then increases the communication frequency in order to achieve a low error floor. Expand
Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training
TLDR
This paper finds 99.9% of the gradient exchange in distributed SGD is redundant, and proposes Deep Gradient Compression (DGC) to greatly reduce the communication bandwidth, which enables large-scale distributed training on inexpensive commodity 1Gbps Ethernet and facilitates distributedTraining on mobile. Expand
Deep learning with Elastic Averaging SGD
TLDR
Experiments demonstrate that the new algorithm accelerates the training of deep architectures compared to DOWNPOUR and other common baseline approaches and furthermore is very communication efficient. Expand
PyTorch distributed
TLDR
Evaluations show that, when configured appropriately, the PyTorch distributed data parallel module attains near-linear scalability using 256 GPUs. Expand
...
1
2
3
4
5
...