Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers

@article{Zhang2022DeepLW,
  title={Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers},
  author={Guodong Zhang and Aleksandar Botev and James Martens},
  journal={ArXiv},
  year={2022},
  volume={abs/2203.08120}
}
Training very deep neural networks is still an extremely challenging task. The common solution is to use shortcut connections and normalization layers, which are both crucial ingredients in the popular ResNet architecture. However, there is strong evidence to suggest that ResNets behave more like ensembles of shallower networks than truly deep ones. Recently, it was shown that deep vanilla networks (i.e. networks without normalization layers or shortcut connections) can be trained as fast as… 

AutoInit: Automatic Initialization via Jacobian Tuning

TLDR
A new and cheap algorithm is introduced, that allows one to have a good initialization automatically, for general feed-forward DNNs, that utilizes the Jacobian between adjacent network blocks to tune the network hyperparameters to criticality.

Pre-training via Denoising for Molecular Property Prediction

TLDR
This paper describes a pre-training technique that utilizes large datasets of 3D molecular structures at equilibrium to learn meaningful representations for downstream tasks, and shows that the objective corresponds to learning a molecular force field – arising from approximating the physical state distribution with a mixture of Gaussians – directly from equilibrium structures.

The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at Initialization

TLDR
This work identifies the precise scaling of the activation function necessary to arrive at a non-trivial limit, and shows that the random covariance matrix is governed by a stochastic differential equation (SDE) that it is called the Neural Covariance SDE.

References

SHOWING 1-10 OF 63 REFERENCES

Going Deeper With Neural Networks Without Skip Connections

TLDR
This work proposes the training of very deep PlainNets by leveraging Leaky Rectified Linear Units (LReLUs), parameter constraint and strategic parameter initialization, and reports the best results known on the ImageNet dataset using a PlainNet.

Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping

TLDR
Deep Kernel Shaping (DKS) enables SGD training of residual networks without normalization layers on Imagenet and CIFAR-10 classification tasks at speeds comparable to standard ResNetV2 and Wide-ResNet models, with only a small decrease in generalization performance.

Disentangling Trainability and Generalization in Deep Neural Networks

TLDR
This work identifies large regions of hyperparameter space for which networks can memorize the training set but completely fail to generalize, and finds thatCNNs without global average pooling behave almost identically to FCNs, but that CNNs with pooling have markedly different and often better generalization performance.

The Shattered Gradients Problem: If resnets are the answer, then what is the question?

TLDR
It is shown that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise whereas, in contrast, thegradients in architectures with skip-connections are far more resistant to shattering, decaying sublinearly.

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

TLDR
This work proposes a Parametric Rectified Linear Unit (PReLU) that generalizes the traditional rectified unit and derives a robust initialization method that particularly considers the rectifier nonlinearities.

Batch Normalization Biases Residual Blocks Towards the Identity Function in Deep Networks

TLDR
This work develops a simple initialization scheme that can train deep residual networks without normalization, and provides a detailed empirical study of residual networks, which clarifies that, although batch normalized networks can be trained with larger learning rates, this effect is only beneficial in specific compute regimes, and has minimal benefits when the batch size is small.

Characterizing signal propagation to close the performance gap in unnormalized ResNets

TLDR
A simple set of analysis tools to characterize signal propagation on the forward pass is proposed, and this technique preserves the signal in networks with ReLU or Swish activation functions by ensuring that the per-channel activation means do not grow with depth.

Self-Normalizing Neural Networks

TLDR
Self-normalizing neural networks (SNNs) are introduced to enable high-level abstract representations and it is proved that activations close to zero mean and unit variance that are propagated through many network layers will converge towards zero meanand unit variance -- even under the presence of noise and perturbations.

ReZero is All You Need: Fast Convergence at Large Depth

TLDR
This work shows that the simplest architecture change of gating each residual connection using a single zero-initialized parameter satisfies initial dynamical isometry and outperforms more complex approaches and is applied to language modeling and finds that it can easily train 120-layer Transformers.

Deep Residual Learning for Image Recognition

TLDR
This work presents a residual learning framework to ease the training of networks that are substantially deeper than those used previously, and provides comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth.
...