• Corpus ID: 227229101

Feature Learning in Infinite-Width Neural Networks

  title={Feature Learning in Infinite-Width Neural Networks},
  author={Greg Yang and Edward J. Hu},
As its width tends to infinity, a deep neural network's behavior under gradient descent can become simplified and predictable (e.g. given by the Neural Tangent Kernel (NTK)), if it is parametrized appropriately (e.g. the NTK parametrization). However, we show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT. We propose simple modifications to the… 

Figures and Tables from this paper

Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks
Comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory are provided, showing that each of these approximations can break down in regimes where general self- Consistent solutions still provide an accurate description.
Deep kernel machines: exact inference with representation learning in infinite Bayesian neural networks
This work gives a proof of unimodality for linear kernels, and a number of experiments in the nonlinear case in which all deep kernel machines initializations the authors tried converged to the same solution.
Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
This work shows that, in the recently discovered Maximal Update Parametrization ( µ P), many optimal HPs remain stable even as model size changes, and introduces a new HP tuning paradigm, µ Transfer, which parametrize the target model in µ P, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly tuning the latter at all.
Synergy and Symmetry in Deep Learning: Interactions between the Data, Model, and Inference Algorithm
This paper analyzes the triplet ( D, M, I ) as an integrated system and identifies important synergies that help mitigate the curse of dimensionality.
Laziness, Barren Plateau, and Noise in Machine Learning
This work precisely reformulates the quantum barren plateau statement towards a precision statement and injects new hope toward near-term variational quantum algorithms, and provides theoretical connections toward classical machine learning.
The Neural Covariance SDE: Shaped Infinite Depth-and-Width Networks at Initialization
This work identifies the precise scaling of the activation function necessary to arrive at a non-trivial limit, and shows that the random covariance matrix is governed by a stochastic differential equation (SDE) that it is called the Neural Covariance SDE.
Feature Learning in L2-regularized DNNs: Attraction/Repulsion and Sparsity
A sparsity result for homogeneous DNNs is proved: any local minimum of the L 2 -regularized loss can be achieved with at most N ( N + 1) neurons in each hidden layer (where N is the size of the training set).
Quadratic models for understanding neural network dynamics
It is shown that the extra quadratic term in NQMs allows for catapult convergence: the loss increases at early stage and then converges afterwards, and the top eigenvalues of the tangent kernel typically decrease after the catapult phase, while they are nearly constant when training with sub-critical learning rates, where the loss converges monotonically.
High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation
It is demonstrated that even one gradient step can lead to a considerable advantage over random features, and the role of learning rate scaling in the initial phase of training is highlighted.
A duality connecting neural network and cosmological dynamics
We demonstrate that the dynamics of neural networks trained with gradient descent and the dynamics of scalar fields in a flat, vacuum energy dominated Universe are structurally profoundly related.


Tensor Programs II: Neural Tangent Kernel for Any Architecture
It is proved that a randomly initialized neural network of *any architecture* has its Tangent Kernel (NTK) converge to a deterministic limit, as the network widths tend to infinity, and a commonly satisfied condition, which is called *Simple GIA Check*, such that the NTK limit calculation based on GIA is correct.
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks
We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning
Distributed Representations of Words and Phrases and their Compositionality
This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.
Efficient Estimation of Word Representations in Vector Space
Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.
Modeling from Features: a Meanfield Framework for Over-parameterized Deep Neural Networks. arXiv:2007.01452 [cs, math, stat
  • URL http://arxiv.org/abs/2007.01452
  • 2020
Tensor Programs III: Neural Matrix Laws
FIP is shown by proving a Master Theorem for any Tensor Program, as introduced in Yang [50,51], generalizing the Master Theorems proved in those works.
Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation
This work opens a way toward design of even stronger Gaussian Processes, initialization schemes to avoid gradient explosion/vanishing, and deeper understanding of SGD dynamics in modern architectures.
Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks
This analysis leads to the first global convergence proof for over-parameterized neural network training with more than $3$ layers in the mean-field regime, and leads to a simpler representation of DNNs, for which the training objective can be reformulated as a convex optimization problem via suitable re- parameterization.
Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers
It is shown that large models are more robust to compression techniques such as quantization and pruning than small models, and one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.