• Corpus ID: 227229101

Feature Learning in Infinite-Width Neural Networks

  title={Feature Learning in Infinite-Width Neural Networks},
  author={Greg Yang and Edward J. Hu},
As its width tends to infinity, a deep neural network's behavior under gradient descent can become simplified and predictable (e.g. given by the Neural Tangent Kernel (NTK)), if it is parametrized appropriately (e.g. the NTK parametrization). However, we show that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT. We propose simple modifications to the… 

Figures and Tables from this paper

Feature Kernel Distillation

It is proved that KD using only pairwise feature kernel compar-isons can improve NN test accuracy in such settings, with both single & ensemble teacher models, whereas standard training without KD fails to generalise.

Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks

Comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory are provided, showing that each of these approximations can break down in regimes where general self- Consistent solutions still provide an accurate description.

Deep kernel machines: exact inference with representation learning in infinite Bayesian neural networks

This work gives a proof of unimodality for linear kernels, and a number of experiments in the nonlinear case in which all deep kernel machines initializations the authors tried converged to the same solution.

Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer

This work shows that, in the recently discovered Maximal Update Parametrization ( µ P), many optimal HPs remain stable even as model size changes, and introduces a new HP tuning paradigm, µ Transfer, which parametrize the target model in µ P, tune the HP indirectly on a smaller model, and zero-shot transfer them to the full-sized model, i.e., without directly tuning the latter at all.

Adaptive Budget Allocation for Parameter-Efficient Fine-Tuning

  • Computer Science
  • 2023
The proposed AdaLoRA adaptively allocates the parameter budget among weight matrices according to their importance score, which allows us to effectively prune the singular values of unimportant updates, which is essentially to reduce their parameter budget but circumvent intensive exact SVD computations.

Provable Particle-based Primal-Dual Algorithm for Mixed Nash Equilibrium

This work proposes a Particle-based Primal-Dual Algorithm (PPDA) for a weakly entropy-regularized min-max optimization procedure over the probability distributions, which employs the stochastic movements of particles to represent the updates of random strategies for the mixed Nash Equilibrium.

Learning time-scales in two-layers neural networks

This paper studies the gradient flow dynamics of a wide two-layer neural network in high-dimension, when data are distributed according to a single-index model (i.e., the target function depends on a one-dimensional projection of the covariates).

How to prepare your task head for finetuning

A significant trend in the effect of changes in this initial energy on the resulting features after fine-tuning is identified and analytically proved in an overparamterized linear setting and verified its applicability to different experimental settings.

Alternating Updates for Efficient Transformers

This work introduces Alternating Updates (AltUp), a simple-to-implement method to increase a model's capacity without the computational burden, and presents extensions of AltUp to the sequence dimension, and demonstrates how it can be synergistically combined with existing approaches to obtain efficient models with even higher capacity.

Grokking modular arithmetic

A simple neural network that can learn modular arithmetic tasks and exhibits a sudden jump in generalization known as “grokking” is presented and evidence that grokking modular arithmetic corresponds to learning feature maps whose structure is determined by the task is presented.



Tensor Programs II: Neural Tangent Kernel for Any Architecture

It is proved that a randomly initialized neural network of *any architecture* has its Tangent Kernel (NTK) converge to a deterministic limit, as the network widths tend to infinity, and a commonly satisfied condition, which is called *Simple GIA Check*, such that the NTK limit calculation based on GIA is correct.

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks

We propose an algorithm for meta-learning that is model-agnostic, in the sense that it is compatible with any model trained with gradient descent and applicable to a variety of different learning

Distributed Representations of Words and Phrases and their Compositionality

This paper presents a simple method for finding phrases in text, and shows that learning good vector representations for millions of phrases is possible and describes a simple alternative to the hierarchical softmax called negative sampling.

Efficient Estimation of Word Representations in Vector Space

Two novel model architectures for computing continuous vector representations of words from very large data sets are proposed and it is shown that these vectors provide state-of-the-art performance on the authors' test set for measuring syntactic and semantic word similarities.

Tensor Programs III: Neural Matrix Laws

FIP is shown by proving a Master Theorem for any Tensor Program, as introduced in Yang [50,51], generalizing the Master Theorems proved in those works.

Scaling Limits of Wide Neural Networks with Weight Sharing: Gaussian Process Behavior, Gradient Independence, and Neural Tangent Kernel Derivation

This work opens a way toward design of even stronger Gaussian Processes, initialization schemes to avoid gradient explosion/vanishing, and deeper understanding of SGD dynamics in modern architectures.

Modeling from Features: a Mean-field Framework for Over-parameterized Deep Neural Networks

This analysis leads to the first global convergence proof for over-parameterized neural network training with more than $3$ layers in the mean-field regime, and leads to a simpler representation of DNNs, for which the training objective can be reformulated as a convex optimization problem via suitable re- parameterization.

Train Large, Then Compress: Rethinking Model Size for Efficient Training and Inference of Transformers

It is shown that large models are more robust to compression techniques such as quantization and pruning than small models, and one can get the best of both worlds: heavily compressed, large models achieve higher accuracy than lightly compressed, small models.

A Rigorous Framework for the Mean Field Limit of Multilayer Neural Networks

A mathematically rigorous framework for multilayer neural networks in the mean field regime with a new idea of a non-evolving probability space that allows to embed neural networks of arbitrary widths and proves a global convergence guarantee for two-layer and three-layer networks.