• Corpus ID: 222125313

On the linearity of large non-linear models: when and why the tangent kernel is constant

@article{Liu2020OnTL,
  title={On the linearity of large non-linear models: when and why the tangent kernel is constant},
  author={Chaoyue Liu and Libin Zhu and Mikhail Belkin},
  journal={ArXiv},
  year={2020},
  volume={abs/2010.01092}
}
The goal of this work is to shed light on the remarkable phenomenon of transition to linearity of certain neural networks as their width approaches infinity. We show that the transition to linearity of the model and, equivalently, constancy of the (neural) tangent kernel (NTK) result from the scaling properties of the norm of the Hessian matrix of the network as a function of the network width. We present a general framework for understanding the constancy of the tangent kernel via Hessian… 

Figures and Tables from this paper

A Neural Tangent Kernel Perspective of GANs
TLDR
A novel theoretical framework of analysis for Generative Adversarial Networks (GANs) is proposed, leveraging the theory of infinitewidth neural networks for the discriminator via its Neural Tangent Kernel to characterize the trained discriminator for a wide range of losses and establish general differentiability properties of the network.
Kernel Methods and Multi-layer Perceptrons Learn Linear Models in High Dimensions
TLDR
This work shows that for a large class of kernels, including the neural tangent kernel of fully connected networks, kernel methods can only perform as well as linear models in this regime, and suggests that more complex models for the data other than independent features are needed for high-dimensional analysis.
Lecture 5: NTK Origin and Derivation
TLDR
Conditions under which training the last layer of an infinitely wide, 1-hidden layer neural network is equivalent to solving kernel regression with the Neural Network Gaussian Process (NNGP) are established.
On the Equivalence between Neural Network and Support Vector Machine
TLDR
The theory can enable three practical applications, including (i) non-vacuous generalization bound of NN via the corresponding KM; (ii) nontrivial robustness certificate for the infinite-width NN (while existing robustness verification methods would provide vacuous bounds); and (iii) intrinsically more robust infinite- width NNs than those from previous kernel regression.
Transition to Linearity of General Neural Networks with Directed Acyclic Graph Architecture
In this paper we show that feedforward neural networks corresponding to arbitrary directed acyclic graphs undergo transition to linearity as their “width” approaches infinity. The width of these
Transition to Linearity of Wide Neural Networks is an Emerging Property of Assembling Weak Models
TLDR
This work shows that the linearity of wide neural networks is, in fact, an emerging property of assembling a large number of diverse “weak” sub-models, none of which dominate the assembly.
Fit without fear: remarkable mathematical phenomena of deep learning through the prism of interpolation
TLDR
Just as a physical prism separates colours mixed within a ray of light, the figurative prism of interpolation helps to disentangle generalization and optimization properties within the complex picture of modern machine learning.
A geometrical viewpoint on the benign overfitting property of the minimum $l_2$-norm interpolant estimator
TLDR
The Dvoretsky dimension appearing naturally in the authors' geometrical viewpoint coincides with the effective rank from [1, 39] and is the key tool to handle the behavior of the design matrix restricted to the sub-space Vk+1:p where overfitting happens.
Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training?
In Neural Architecture Search (NAS), reducing the cost of architecture evaluation remains one of the most crucial challenges. Among a plethora of efforts to bypass training of each candidate
Embedded Ensembles: Infinite Width Limit and Operating Regimes
TLDR
This paper uses a Neural-Tangent-Kernel-based approach to derive the wide network limit of the gradient descent dynamics and proves that in the independent regime the embedded ensemble behaves as an ensemble of independent models.
...
1
2
3
4
...

References

SHOWING 1-10 OF 23 REFERENCES
Toward a theory of optimization for over-parameterized systems of non-linear equations: the lessons of deep learning
TLDR
This work shows that optimization problems corresponding to over-parameterized systems of non-linear equations are not convex, even locally, but instead satisfy the Polyak-Lojasiewicz (PL) condition allowing for efficient optimization by gradient descent or SGD.
On Lazy Training in Differentiable Programming
TLDR
This work shows that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels.
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
TLDR
This work shows that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.
Gradient descent optimizes over-parameterized deep ReLU networks
TLDR
The key idea of the proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent.
Linearized two-layers neural networks in high dimension
TLDR
It is proved that, if both $d$ and $N$ are large, the behavior of these models is instead remarkably simpler, and an equally simple bound on the generalization error of Kernel Ridge Regression is obtained.
Gradient Descent Provably Optimizes Over-parameterized Neural Networks
TLDR
Over-parameterization and random initialization jointly restrict every weight vector to be close to its initialization for all iterations, which allows a strong convexity-like property to show that gradient descent converges at a global linear rate to the global optimum.
Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks
TLDR
This paper analyzes training and generalization for a simple 2-layer ReLU net with random initialization, and provides the following improvements over recent works: a tighter characterization of training speed, an explanation for why training a neuralNet with random labels leads to slower training, and a data-dependent complexity measure.
Gradient Descent with Identity Initialization Efficiently Learns Positive-Definite Linear Transformations by Deep Residual Networks
TLDR
It is shown that if the least-squares matrix Φ is symmetric and has a negative eigenvalue, then all members of a class of algorithms that perform gradient descent with identity initialization, and optionally regularize toward the identity in each layer, fail to converge.
A Convergence Theory for Deep Learning via Over-Parameterization
TLDR
This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting.
Adaptive estimation of a quadratic functional by model selection
We consider the problem of estimating ∥s∥ 2 when s belongs to some separable Hilbert space and one observes the Gaussian process Y(t) = (s, t) + σ L(t), for all t ∈ H, where L is some Gaussian
...
1
2
3
...