# On the linearity of large non-linear models: when and why the tangent kernel is constant

@article{Liu2020OnTL, title={On the linearity of large non-linear models: when and why the tangent kernel is constant}, author={Chaoyue Liu and Libin Zhu and Mikhail Belkin}, journal={ArXiv}, year={2020}, volume={abs/2010.01092} }

The goal of this work is to shed light on the remarkable phenomenon of transition to linearity of certain neural networks as their width approaches infinity. We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width. We present a general framework for understanding the constancy of the tangent kernel via Hessian…

## 39 Citations

A Neural Tangent Kernel Perspective of GANs

- Computer ScienceArXiv
- 2021

A novel theoretical framework of analysis for Generative Adversarial Networks (GANs) is proposed, leveraging the theory of infinitewidth neural networks for the discriminator via its Neural Tangent Kernel to characterize the trained discriminator for a wide range of losses and establish general differentiability properties of the network.

Kernel Methods and Multi-layer Perceptrons Learn Linear Models in High Dimensions

- Computer Science
- 2022

This work shows that for a large class of kernels, including the neural tangent kernel of fully connected networks, kernel methods can only perform as well as linear models in this regime, and suggests that more complex models for the data other than independent features are needed for high-dimensional analysis.

Lecture 5: NTK Origin and Derivation

- Computer Science
- 2022

Conditions under which training the last layer of an infinitely wide, 1-hidden layer neural network is equivalent to solving kernel regression with the Neural Network Gaussian Process (NNGP) are established.

On the Equivalence between Neural Network and Support Vector Machine

- Computer ScienceNeurIPS
- 2021

The theory can enable three practical applications, including (i) non-vacuous generalization bound of NN via the corresponding KM; (ii) nontrivial robustness certificate for the infinite-width NN (while existing robustness verification methods would provide vacuous bounds); and (iii) intrinsically more robust infinite- width NNs than those from previous kernel regression.

Transition to Linearity of General Neural Networks with Directed Acyclic Graph Architecture

- Computer Science
- 2022

In this paper we show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their “width” approaches inﬁnity. The width of these…

Transition to Linearity of Wide Neural Networks is an Emerging Property of Assembling Weak Models

- Computer ScienceArXiv
- 2022

This work shows that the linearity of wide neural networks is, in fact, an emerging property of assembling a large number of diverse “weak” sub-models, none of which dominate the assembly.

Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation

- Computer ScienceActa Numerica
- 2021

Just as a physical prism separates colours mixed within a ray of light, the figurative prism of interpolation helps to disentangle generalization and optimization properties within the complex picture of modern machine learning.

A geometrical viewpoint on the benign overfitting property of the minimum $l_2$-norm interpolant estimator

- Computer Science
- 2022

The Dvoretsky dimension appearing naturally in the authors' geometrical viewpoint coincides with the effective rank from [1, 39] and is the key tool to handle the behavior of the design matrix restricted to the sub-space Vk+1:p where overfitting happens.

Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training?

- Computer ScienceArXiv
- 2022

In Neural Architecture Search (NAS), reducing the cost of architecture evaluation remains one of the most crucial challenges. Among a plethora of efforts to bypass training of each candidate…

Embedded Ensembles: Infinite Width Limit and Operating Regimes

- Computer ScienceAISTATS
- 2022

This paper uses a Neural-Tangent-Kernel-based approach to derive the wide network limit of the gradient descent dynamics and proves that in the independent regime the embedded ensemble behaves as an ensemble of independent models.

## References

SHOWING 1-10 OF 23 REFERENCES

Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning

- Computer ScienceArXiv
- 2020

This work shows that optimization problems corresponding to over-parameterized systems of non-linear equations are not convex, even locally, but instead satisfy the Polyak-Lojasiewicz (PL) condition allowing for efficient optimization by gradient descent or SGD.

On Lazy Training in Differentiable Programming

- Computer ScienceNeurIPS
- 2019

This work shows that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels.

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

- Computer ScienceNeurIPS
- 2019

This work shows that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.

Gradient descent optimizes over-parameterized deep ReLU networks

- Computer ScienceMachine Learning
- 2019

The key idea of the proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent.

Linearized two-layers neural networks in high dimension

- Computer ScienceArXiv
- 2019

It is proved that, if both $d$ and $N$ are large, the behavior of these models is instead remarkably simpler, and an equally simple bound on the generalization error of Kernel Ridge Regression is obtained.

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

- Computer ScienceICLR
- 2019

Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum.

Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks

- Computer ScienceICML
- 2019

This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure.

Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks

- Computer ScienceNeural Computation
- 2019

It is shown that if the least-squares matrix Φ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge.

A Convergence Theory for Deep Learning via Over-Parameterization

- Computer ScienceICML
- 2019

This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.

Adaptive estimation of a quadratic functional by model selection

- Mathematics
- 2000

We consider the problem of estimating ∥s∥ 2 when s belongs to some separable Hilbert space and one observes the Gaussian process Y(t) = (s, t) + σ L(t), for all t ∈ H, where L is some Gaussian…