• Corpus ID: 227343996

Parallel Training of Deep Networks with Local Updates

  title={Parallel Training of Deep Networks with Local Updates},
  author={Michael Laskin and Luke Metz and Seth Nabarrao and Mark Saroufim and Badreddine Noune and Carlo Luschi and Jascha Sohl-Dickstein and P. Abbeel},
Deep learning models trained on large data sets have been widely successful in both vision and language domains. As state-of-the-art deep learning architectures have continued to grow in parameter count so have the compute budgets and times required to train them, increasing the need for compute-efficient methods that parallelize training. Two common approaches to parallelize the training of deep networks have been data and model parallelism. While useful, data and model parallelism suffer from… 
Accelerating Federated Learning with Split Learning on Locally Generated Losses
Federated learning (FL) operates based on model exchanges between the server and the clients, and suffers from significant communication as well as client-side computation burden. While emerging
AdaSplit: Adaptive Trade-offs for Resource-constrained Distributed Deep Learning
AdaSplit is introduced which enables efficiently scaling SL to low resource scenarios by reducing bandwidth consumption and improving performance across heterogeneous clients and C3-Score, a metric to evaluate performance under resource budgets is introduced.
Decoupled Greedy Learning of CNNs for Synchronous and Asynchronous Distributed Learning
This work considers an optimization of this objective that permits us to decouple the layer training, allowing for layers or modules in networks to be trained with a potentially linear parallelization, and proposes an approach based on online vector quantization to address bandwidth and memory issues.
NoPeek-Infer: Preventing face reconstruction attacks in distributed inference after on-premise training
For models trained on-premise but deployed in a distributed fashion across multiple entities, we demonstrate that minimizing distance correlation between sensitive data such as faces and intermediary
Training Spiking Neural Networks Using Lessons From Deep Learning
The delicate interplay between encoding data as spikes and the learning process; the challenges and solutions of applying gradient-based learning to spiking neural networks; the subtle link between temporal backpropagation and spike timing dependent plasticity; and how deep learning might move towards biologically plausible online learning are explored.


Dissecting the Graphcore IPU Architecture via Microbenchmarking
This report focuses on the architecture and performance of the Intelligence Processing Unit (IPU), a novel, massively parallel platform recently introduced by Graphcore and aimed at Artificial
An Empirical Model of Large-Batch Training
It is demonstrated that a simple and easy-to-measure statistic called the gradient noise scale predicts the largest useful batch size across many domains and applications, including a number of supervised learning datasets, reinforcement learning domains, and even generative model training (autoencoders on SVHN).
PipeDream: Fast and Efficient Pipeline Parallel DNN Training
Experiments with five different DNNs on two different clusters show that PipeDream is up to 5x faster in time-to-accuracy compared to data-parallel training.
Revisiting Distributed Synchronous SGD
It is demonstrated that a third approach, synchronous optimization with backup workers, can avoid asynchronous noise while mitigating for the worst stragglers and is empirically validated and shown to converge faster and to better test accuracies.
Decoupled Greedy Learning of CNNs
Decoupled Greedy Learning is considered, based on a greedy relaxation of the joint training objective, recently shown to be effective in the context of Convolutional Neural Networks (CNNs) on large-scale image classification, and it is shown that it can lead to better generalization than sequential greedy optimization.
LoCo: Local Contrastive Representation Learning
By overlapping local blocks stacking on top of each other, this work effectively increases the decoder depth and allow upper blocks to implicitly send feedbacks to lower blocks, which closes the performance gap between local learning and end-to-end contrastive learning algorithms for the first time.
GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism
GPipe is introduced, a pipeline parallelism library that allows scaling any network that can be expressed as a sequence of layers by pipelining different sub-sequences of layers on separate accelerators, resulting in almost linear speedup when a model is partitioned across multiple accelerators.
Greedy Layerwise Learning Can Scale to ImageNet
This work uses 1-hidden layer learning problems to sequentially build deep networks layer by layer, which can inherit properties from shallow networks, and obtains an 11-layer network that exceeds several members of the VGG model family on ImageNet, and can train a VGG-11 model to the same accuracy as end-to-end learning.
Putting An End to End-to-End: Gradient-Isolated Learning of Representations
A novel deep learning method for local self-supervised representation learning that does not require labels nor end-to-end backpropagation but exploits the natural order in data instead is proposed, allowing large-scale distributed training of very deep neural networks on unlabelled datasets.
Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis.
The problem of parallelization in DNNs is described from a theoretical perspective, followed by approaches for its parallelization, and potential directions for parallelism in deep learning are extrapolated.