Fully Decoupled Neural Network Learning Using Delayed Gradients

  title={Fully Decoupled Neural Network Learning Using Delayed Gradients},
  author={Huiping Zhuang and Yi Wang and Qinglai Liu and Zhiping Lin},
  journal={IEEE transactions on neural networks and learning systems},
Training neural networks with backpropagation (BP) requires a sequential passing of activations and gradients. This has been recognized as the lockings (i.e., the forward, backward, and update lockings) among modules (each module contains a stack of layers) inherited from the BP. In this brief, we propose a fully decoupled training scheme using delayed gradients (FDG) to break all these lockings. The FDG splits a neural network into multiple modules and trains them independently and… 

Figures and Tables from this paper

Accumulated Decoupled Learning with Gradient Staleness Mitigation for Convolutional Neural Networks
This paper proposes an accumulated decoupled learning (ADL), which includes a module-wise gradient accumulation in order to mitigate the gradient staleness, and quantifies the staleness in such a way that its mitigation can be quantitatively visualized.
Layer-Wise Partitioning and Merging for Efficient and Scalable Deep Learning
A novel layer-wise partitioning and merging, forward and backward pass parallel framework to provide better training performance and outperforms the state-of-the-art approaches in terms of training speed and achieves almost linear speedup without compromising the accuracy performance of the non-parallel approach.
Pipelined Backpropagation at Scale: Training Large Models without Batches
This work evaluates the use of small batch, fine-grained Pipelined Backpropagation, an asynchronous pipeline parallel training algorithm that has significant hardware advantages and introduces two methods, Spike Compensation and Linear Weight Prediction, that effectively mitigate the downsides caused by the asynchronicity of Pipeline Backpropaganda and outperform existing techniques in this setting.
Toward Model Parallelism for Deep Neural Network based on Gradient-free ADMM Framework
This paper proposes a novel parallel deep learning ADMM framework (pdADMM) to achieve layer parallelism: parameters in each layer of neural networks can be updated independently in parallel in parallel.
A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning
This work has proposed a generic full end-to-end hybrid parallelization approach combining model and data parallelism for efficiently distributed and scalable training of DNN models and proposed a Genetic Algorithm Based Heuristic Resources Allocation mechanism for optimal distribution of partitions on the available GPUs for computing performance optimization.
Cortico-cerebellar networks as decoupling neural interfaces
This work offers a novel perspective on the cerebellum as a brainwide decoupling machine for efficient credit assignment and opens a new avenue between deep learning and neuroscience.
Towards Quantized Model Parallelism for Graph-Augmented MLPs Based on Gradient-Free ADMM framework
A parallel deep learning Alternating Direction Method of Multipliers (pdADMM) framework to achieve model parallelism: parameters in each layer of GA-MLP models can be updated in parallel.
Distributed Hierarchical Sentence Embeddings for Unsupervised Extractive Text Summarization
A hierarchical BERT model that contains both word-level and sentence-level training processes to achieve semantic-rich sentence embeddings is proposed that outperforms most popular models and achieves a speedup of 2.7 in training time on 4 machines.
Approximate to Be Great: Communication Efficient and Privacy-Preserving Large-Scale Distributed Deep Learning in Internet of Things
A communication efficient and privacy-preserving framework to enable different participants to distributively learn a model with a privacy protection guarantee is designed and a differentially private approximate mechanism for the distributed deep learning is developed.
Accumulated Decoupled Learning: Mitigating Gradient Staleness in Inter-Layer Model Parallelization
An accumulated decoupled learning (ADL) which incorporates the gradient accumulation technique to mitigate the stale gradient effect is proposed and it is proved that the proposed method can converge to critical points, i.e., the gradients converge to 0, in spite of its asynchronous nature.


Decoupled Greedy Learning of CNNs
Decoupled Greedy Learning is considered, based on a greedy relaxation of the joint training objective, recently shown to be effective in the context of Convolutional Neural Networks (CNNs) on large-scale image classification, and it is shown that it can lead to better generalization than sequential greedy optimization.
Training Neural Networks Using Features Replay
This work proposes a novel parallel-objective formulation for the objective function of the neural network, and introduces features replay algorithm and proves that it is guaranteed to converge to critical points for the non-convex problem under certain conditions.
Decoupled Parallel Backpropagation with Convergence Guarantee
Decoupled parallel backpropagation algorithm for deep learning optimization with convergence guarantee is proposed and it is proved that the method guarantees convergence to critical points for the non-convex problem.
Decoupled Neural Interfaces using Synthetic Gradients
It is demonstrated that in addition to predicting gradients, the same framework can be used to predict inputs, resulting in models which are decoupled in both the forward and backwards pass -- amounting to independent networks which co-learn such that they can be composed into a single functioning corporation.
Asynchronous Stochastic Gradient Descent with Delay Compensation
The proposed algorithm is evaluated on CIFAR-10 and ImageNet datasets, and the experimental results demonstrate that DC-ASGD outperforms both synchronous SGD and asynchronous SGD, and nearly approaches the performance of sequential SGD.
Training Neural Networks with Local Error Signals
It is demonstrated, for the first time, that layer-wise training can approach the state-of-the-art on a variety of image datasets and a completely backprop free variant outperforms previously reported results among methods aiming for higher biological plausibility.
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Applied to a state-of-the-art image classification model, Batch Normalization achieves the same accuracy with 14 times fewer training steps, and beats the original model by a significant margin.
Densely Connected Convolutional Networks
The Dense Convolutional Network (DenseNet), which connects each layer to every other layer in a feed-forward fashion, and has several compelling advantages: they alleviate the vanishing-gradient problem, strengthen feature propagation, encourage feature reuse, and substantially reduce the number of parameters.
Deep Supervised Learning Using Local Errors
The proposed learning mechanism based on fixed, broad, and random tuning of each neuron to the classification categories outperforms the biologically-motivated feedback alignment learning technique on the CIFAR10 dataset, approaching the performance of standard backpropagation.
Assessing the Scalability of Biologically-Motivated Deep Learning Algorithms and Architectures
Results on scaling up biologically motivated models of deep learning on datasets which need deep networks with appropriate architectures to achieve good performance are presented and implementation details help establish baselines for biologically motivated deep learning schemes going forward.