• Corpus ID: 238354214

Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping

@article{Martens2021RapidTO,
  title={Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping},
  author={James Martens and Andy Ballard and Guillaume Desjardins and Grzegorz Swirszcz and Valentin Dalibard and Jascha Narain Sohl-Dickstein and Samuel S. Schoenholz},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.01765}
}
Using an extended and formalized version of the Q/C map analysis of Poole et al. (2016), along with Neural Tangent Kernel theory, we identify the main pathologies present in deep networks that prevent them from training fast and generalizing to unseen data, and show how these can be avoided by carefully controlling the"shape"of the network's initialization-time kernel function. We then develop a method called Deep Kernel Shaping (DKS), which accomplishes this using a combination of precise… 

Deep Learning without Shortcuts: Shaping the Kernel with Tailored Rectifiers

The method, which introduces negligible extra computational cost, achieves validation accuracies with deep vanilla networks that are competitive with ResNets (of the same width/depth), and significantly higher than those obtained with the Edge of Chaos (EOC) method.

Improving the Trainability of Deep Neural Networks through Layerwise Batch-Entropy Regularization

It is proved empirically and theoretically that a positive batch-entropy is required for gradient descent-based training approaches to optimize a given loss function successfully, and it is shown that a "vanilla" fully connected network and convolutional neural network can be trained with no skip connections, batch normalization, dropout, or any other architectural tweak.

Critical initialization of wide and deep neural networks through partial Jacobians: general theory and applications to LayerNorm

A new practical way to diagnose criticality of deep fully connected neural networks with LayerNorm and/or residual connections is described and a simple and cheap numerical test is derived that allows to select optimal initialization for a broad class of deep neural networks.

Understanding the Covariance Structure of Convolutional Filters

This work observes that learned filters have highly-structured covariance matrices, and finds that covariances calculated from small networks may be used to effectively initialize a variety of larger networks of different depths, widths, patch sizes, and kernel sizes, indicating a degree of model-independence to the covariance structure.

AutoInit: Automatic Initialization via Jacobian Tuning

A new and cheap algorithm is introduced, that allows one to have a good initialization automatically, for general feed-forward DNNs, that utilizes the Jacobian between adjacent network blocks to tune the network hyperparameters to criticality.

Exploring the Gap between Collapsed & Whitened Features in Self-Supervised Learning

This work identifies power law behaviour in eigenvalue decay, parameterised by exponent β ≥ 0, as a spectrum that bridges between the collapsed & whitened feature extremes, and motivates a novel method, Post-hoc Manipulation of the Principal Axes & Trace (PostMan-Pat), which delivers improved label efficiency and transferability across a range of SSL methods and encoder architectures.

On the Optimization Landscape of Neural Collapse under MSE Loss: Global Optimality with Unconstrained Features

Under a simplified unconstrained feature model, this work provides the first global landscape analysis for vanilla nonconvex MSE loss and shows that the (only!) global minimizers are neural collapse solutions, while all other critical points are strict saddles whose Hessian exhibit negative curvature directions.

Neural Tangent Kernel: A Survey

The present survey covers key results on kernel convergence as width goes to infinity, flnite-width corrections, applications, and discussion of limitations of the corresponding method.

Deep equilibrium networks are sensitive to initialization statistics

It is shown that DEQs are sensitive to the higher order statistics of the matrix families from which they are initialized, which gives a practical pre-scription for initializations which allow for training with a broader range of initial weight scales.

Nonlinear Initialization Methods for Low-Rank Neural Networks

A practical algorithm is provided to solve the ReLU low-rank approximation problem for parameter rank r which is no more expensive than the existing spectral initialization approach and provably demonstrates that there is a gap between these two approaches for ReLU networks.