• Corpus ID: 59222702

Decoupled Greedy Learning of CNNs

@article{Belilovsky2019DecoupledGL,
  title={Decoupled Greedy Learning of CNNs},
  author={Eugene Belilovsky and Michael Eickenberg and Edouard Oyallon},
  journal={ArXiv},
  year={2019},
  volume={abs/1901.08164}
}
A commonly cited inefficiency of neural network training by back-propagation is the update locking problem: each layer must wait for the signal to propagate through the full network before updating. Several alternatives that can alleviate this issue have been proposed. In this context, we consider a simpler, but more effective, substitute that uses minimal feedback, which we call Decoupled Greedy Learning (DGL). It is based on a greedy relaxation of the joint training objective, recently shown… 

Figures and Tables from this paper

Decoupled Greedy Learning of CNNs for Synchronous and Asynchronous Distributed Learning

This work considers an optimization of this objective that permits us to decouple the layer training, allowing for layers or modules in networks to be trained with a potentially linear parallelization, and proposes an approach based on online vector quantization to address bandwidth and memory issues.

Reducing the Computational Burden of Deep Learning with Recursive Local Representation Alignment

Experiments with deep residual networks on CIFAR-10 and the massive-scale benchmark, ImageNet, show that the proposed gradient-free learning procedure generalizes as well as backprop while converging sooner due to weight updates that are parallelizable and computationally less demanding.

Scaling Forward Gradient With Local Losses

This paper shows that it is possible to substantially reduce the variance of the forward gradient estimator by applying perturbations to activations rather than weights, and improves the scalability of forward gradient by introducing a large number of local greedy loss functions and a new MLPMixer-inspired architecture, LocalMixer, that is more suitable for local learning.

Local Learning with Neuron Groups

This work proposes to study how local learning can be applied at the level of splitting layers or modules into sub-components, adding a notion of width-wise modularity to the existing depth- wise modularity associated with local learning.

Revisiting Locally Supervised Learning: an Alternative to End-to-end Training

An information propagation (InfoPro) loss is proposed, which encourages local modules to preserve as much useful information as possible, while progressively discard task-irrelevant information, and is capable of achieving competitive performance with less than 40% memory footprint compared to E2E training.

ALLOW FOR FEEDFORWARD TRAINING OF DEEP NEURAL NETWORKS

This work shows that the one-hot-encoded labels provided in supervised classification problems, denoted as targets, can be viewed as a proxy for the error sign and enable a layerwise feedforward training of the hidden layers, thus solving the weight transport and update locking problems while relaxing the computational and memory requirements.

Accumulated Decoupled Learning with Gradient Staleness Mitigation for Convolutional Neural Networks

This paper proposes an accumulated decoupled learning (ADL), which includes a module-wise gradient accumulation in order to mitigate the gradient staleness, and quantifies the staleness in such a way that its mitigation can be quantitatively visualized.

Error-driven Input Modulation: Solving the Credit Assignment Problem without a Backward Pass

This work proposes to replace the backward pass with a second forward pass in which the input signal is modulated based on the error of the network, and shows that this novel learning rule comprehensively addresses all the above-mentioned issues and can be applied to both fully connected and convolutional models.

Diversely Stale Parameters for Efficient Training of CNNs

This paper proposes Layer-wise Staleness and a novel efficient training algorithm, Diversely Stale Parameters (DSP), which can address all these challenges without loss of accuracy nor memory issue and extensive experimental results on training deep convolutional neural networks demonstrate that the proposed DSP algorithm can achieve significant training speedup.

Decoupled Greedy Learning of Graph Neural Networks

A decoupled greedy learning method for GNNs (DGL-GNN) that decouples the GNN into smaller modules and associates each module with greedy auxiliary objectives is introduced, allowing GNN layers to be updated during the training process without waiting for feedback from successor layers, thus making parallel GNN training possible.
...

References

SHOWING 1-10 OF 42 REFERENCES

Deep Cascade Learning

The features learned by the deep cascade learning algorithm are investigated and it is found that better, domain-specific, representations are learned in early layers when compared to what is learned in end–end training.

Greedy Layerwise Learning Can Scale to ImageNet

This work uses 1-hidden layer learning problems to sequentially build deep networks layer by layer, which can inherit properties from shallow networks, and obtains an 11-layer network that exceeds several members of the VGG model family on ImageNet, and can train a VGG-11 model to the same accuracy as end-to-end learning.

Training Neural Networks Using Features Replay

This work proposes a novel parallel-objective formulation for the objective function of the neural network, and introduces features replay algorithm and proves that it is guaranteed to converge to critical points for the non-convex problem under certain conditions.

Greedy Layer-Wise Training of Deep Networks

These experiments confirm the hypothesis that the greedy layer-wise unsupervised training strategy mostly helps the optimization, by initializing weights in a region near a good local minimum, giving rise to internal distributed representations that are high-level abstractions of the input, bringing better generalization.

Understanding Synthetic Gradients and Decoupled Neural Interfaces

This paper studies DNIs through the use of synthetic gradients on feed-forward networks to better understand their behaviour and elucidate their effect on optimisation, and shows that the incorporation of SGs does not affect the representational strength of the learning system for a neural network, and proves the convergence of thelearning system for linear and deep linear models.

Training Neural Networks with Local Error Signals

It is demonstrated, for the first time, that layer-wise training can approach the state-of-the-art on a variety of image datasets and a completely backprop free variant outperforms previously reported results among methods aiming for higher biological plausibility.

Training Neural Networks Without Gradients: A Scalable ADMM Approach

This paper explores an unconventional training method that uses alternating direction methods and Bregman iteration to train networks without gradient descent steps, and exhibits strong scaling in the distributed setting, yielding linear speedups even when split over thousands of cores.

Decoupled Neural Interfaces using Synthetic Gradients

It is demonstrated that in addition to predicting gradients, the same framework can be used to predict inputs, resulting in models which are decoupled in both the forward and backwards pass -- amounting to independent networks which co-learn such that they can be composed into a single functioning corporation.

Building a Regular Decision Boundary with Deep Networks

  • Edouard Oyallon
  • Computer Science
    2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
  • 2017
This work builds a generic architecture of Convolutional Neural Networks to discover empirical properties of neural networks and shows that the nonlinearity of a deep network does not need to be continuous, non expansive or point-wise, to achieve good performance.

Decoupled Parallel Backpropagation with Convergence Guarantee

Decoupled parallel backpropagation algorithm for deep learning optimization with convergence guarantee is proposed and it is proved that the method guarantees convergence to critical points for the non-convex problem.