Corpus ID: 227343996

Parallel Training of Deep Networks with Local Updates

@article{Laskin2020ParallelTO,
  title={Parallel Training of Deep Networks with Local Updates},
  author={Michael Laskin and Luke Metz and Seth Nabarrao and Mark Saroufim and Badreddine Noune and Carlo Luschi and Jascha Sohl-Dickstein and P. Abbeel},
  journal={ArXiv},
  year={2020},
  volume={abs/2012.03837}
}
Deep learning models trained on large data sets have been widely successful in both vision and language domains. As state-of-the-art deep learning architectures have continued to grow in parameter count so have the compute budgets and times required to train them, increasing the need for compute-efficient methods that parallelize training. Two common approaches to parallelize the training of deep networks have been data and model parallelism. While useful, data and model parallelism suffer from… Expand
AdaSplit: Adaptive Trade-offs for Resource-constrained Distributed Deep Learning
  • 2021
Distributed deep learning frameworks like Federated learning (FL) and its variants are enabling personalized experiences across a wide range of web clients and mobile/IoT devices. However, theseExpand
Decoupled Greedy Learning of CNNs for Synchronous and Asynchronous Distributed Learning
TLDR
This work considers an optimization of this objective that permits us to decouple the layer training, allowing for layers or modules in networks to be trained with a potentially linear parallelization, and proposes an approach based on online vector quantization to address bandwidth and memory issues. Expand
NoPeek-Infer: Preventing face reconstruction attacks in distributed inference after on-premise training
For models trained on-premise but deployed in a distributed fashion across multiple entities, we demonstrate that minimizing distance correlation between sensitive data such as faces and intermediaryExpand

References

SHOWING 1-10 OF 86 REFERENCES
Dissecting the Graphcore IPU Architecture via Microbenchmarking
This report focuses on the architecture and performance of the Intelligence Processing Unit (IPU), a novel, massively parallel platform recently introduced by Graphcore and aimed at ArtificialExpand
An Empirical Model of Large-Batch Training
TLDR
It is demonstrated that a simple and easy-to-measure statistic called the gradient noise scale predicts the largest useful batch size across many domains and applications, including a number of supervised learning datasets, reinforcement learning domains, and even generative model training (autoencoders on SVHN). Expand
PipeDream: Fast and Efficient Pipeline Parallel DNN Training
TLDR
Experiments with five different DNNs on two different clusters show that PipeDream is up to 5x faster in time-to-accuracy compared to data-parallel training. Expand
Revisiting Distributed Synchronous SGD
TLDR
It is demonstrated that a third approach, synchronous optimization with backup workers, can avoid asynchronous noise while mitigating for the worst stragglers and is empirically validated and shown to converge faster and to better test accuracies. Expand
Decoupled Greedy Learning of CNNs
TLDR
Decoupled Greedy Learning is considered, based on a greedy relaxation of the joint training objective, recently shown to be effective in the context of Convolutional Neural Networks (CNNs) on large-scale image classification, and it is shown that it can lead to better generalization than sequential greedy optimization. Expand
LoCo: Local Contrastive Representation Learning
TLDR
By overlapping local blocks stacking on top of each other, this work effectively increases the decoder depth and allow upper blocks to implicitly send feedbacks to lower blocks, which closes the performance gap between local learning and end-to-end contrastive learning algorithms for the first time. Expand
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
TLDR
GPipe is introduced, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers by pipelining different sub-sequences of layers on separate accelerators, resulting in almost linear speedup when a model is partitioned across multiple accelerators. Expand
Greedy Layerwise Learning Can Scale to ImageNet
TLDR
This work uses 1-hidden layer learning problems to sequentially build deep networks layer by layer, which can inherit properties from shallow networks, and obtains an 11-layer network that exceeds several members of the VGG model family on ImageNet, and can train a VGG-11 model to the same accuracy as end-to-end learning. Expand
Putting An End to End-to-End: Gradient-Isolated Learning of Representations
TLDR
A novel deep learning method for local self-supervised representation learning that does not require labels nor end-to-end backpropagation but exploits the natural order in data instead is proposed, allowing large-scale distributed training of very deep neural networks on unlabelled datasets. Expand
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis.
TLDR
The problem of parallelization in DNNs is described from a theoretical perspective, followed by approaches for its parallelization, and potential directions for parallelism in deep learning are extrapolated. Expand
...
1
2
3
4
5
...