• Corpus ID: 222125313

# On the linearity of large non-linear models: when and why the tangent kernel is constant

@article{Liu2020OnTL,
title={On the linearity of large non-linear models: when and why the tangent kernel is constant},
author={Chaoyue Liu and Libin Zhu and Mikhail Belkin},
journal={ArXiv},
year={2020},
volume={abs/2010.01092}
}
• Published 2 October 2020
• Computer Science
• ArXiv
The goal of this work is to shed light on the remarkable phenomenon of transition to linearity of certain neural networks as their width approaches infinity. We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width. We present a general framework for understanding the constancy of the tangent kernel via Hessian…

## Figures and Tables from this paper

A Neural Tangent Kernel Perspective of GANs
• Computer Science
ArXiv
• 2021
A novel theoretical framework of analysis for Generative Adversarial Networks (GANs) is proposed, leveraging the theory of infinitewidth neural networks for the discriminator via its Neural Tangent Kernel to characterize the trained discriminator for a wide range of losses and establish general differentiability properties of the network.
Kernel Methods and Multi-layer Perceptrons Learn Linear Models in High Dimensions
• Computer Science
• 2022
This work shows that for a large class of kernels, including the neural tangent kernel of fully connected networks, kernel methods can only perform as well as linear models in this regime, and suggests that more complex models for the data other than independent features are needed for high-dimensional analysis.
Lecture 5: NTK Origin and Derivation
• Computer Science
• 2022
Conditions under which training the last layer of an infinitely wide, 1-hidden layer neural network is equivalent to solving kernel regression with the Neural Network Gaussian Process (NNGP) are established.
On the Equivalence between Neural Network and Support Vector Machine
• Computer Science
NeurIPS
• 2021
The theory can enable three practical applications, including (i) non-vacuous generalization bound of NN via the corresponding KM; (ii) nontrivial robustness certificate for the infinite-width NN (while existing robustness verification methods would provide vacuous bounds); and (iii) intrinsically more robust infinite- width NNs than those from previous kernel regression.
Transition to Linearity of General Neural Networks with Directed Acyclic Graph Architecture
• Computer Science
• 2022
In this paper we show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their “width” approaches inﬁnity. The width of these
Transition to Linearity of Wide Neural Networks is an Emerging Property of Assembling Weak Models
• Computer Science
ArXiv
• 2022
This work shows that the linearity of wide neural networks is, in fact, an emerging property of assembling a large number of diverse “weak” sub-models, none of which dominate the assembly.
Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation
Just as a physical prism separates colours mixed within a ray of light, the figurative prism of interpolation helps to disentangle generalization and optimization properties within the complex picture of modern machine learning.
A geometrical viewpoint on the benign overfitting property of the minimum $l_2$-norm interpolant estimator
• Computer Science
• 2022
The Dvoretsky dimension appearing naturally in the authors' geometrical viewpoint coincides with the effective rank from [1, 39] and is the key tool to handle the behavior of the design matrix restricted to the sub-space Vk+1:p where overfitting happens.
Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training?
• Computer Science
ArXiv
• 2022
In Neural Architecture Search (NAS), reducing the cost of architecture evaluation remains one of the most crucial challenges. Among a plethora of efforts to bypass training of each candidate
Embedded Ensembles: Infinite Width Limit and Operating Regimes
• Computer Science
AISTATS
• 2022
This paper uses a Neural-Tangent-Kernel-based approach to derive the wide network limit of the gradient descent dynamics and proves that in the independent regime the embedded ensemble behaves as an ensemble of independent models.

## References

SHOWING 1-10 OF 23 REFERENCES
Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning
• Computer Science
ArXiv
• 2020
This work shows that optimization problems corresponding to over-parameterized systems of non-linear equations are not convex, even locally, but instead satisfy the Polyak-Lojasiewicz (PL) condition allowing for efficient optimization by gradient descent or SGD.
On Lazy Training in Differentiable Programming
• Computer Science
NeurIPS
• 2019
This work shows that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels.
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
• Computer Science
NeurIPS
• 2019
This work shows that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.
Gradient descent optimizes over-parameterized deep ReLU networks
• Computer Science
Machine Learning
• 2019
The key idea of the proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent.
Linearized two-layers neural networks in high dimension
• Computer Science
ArXiv
• 2019
It is proved that, if both $d$ and $N$ are large, the behavior of these models is instead remarkably simpler, and an equally simple bound on the generalization error of Kernel Ridge Regression is obtained.
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
• Computer Science
ICLR
• 2019
Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum.
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
• Computer Science
ICML
• 2019
This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure.
Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks
• Computer Science
Neural Computation
• 2019
It is shown that if the least-squares matrix Φ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge.
A Convergence Theory for Deep Learning via Over-Parameterization
• Computer Science
ICML
• 2019
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.