• Corpus ID: 235422672

What can linearized neural networks actually say about generalization?

@inproceedings{OrtizJimnez2021WhatCL,
  title={What can linearized neural networks actually say about generalization?},
  author={Guillermo Ortiz-Jim{\'e}nez and Seyed-Mohsen Moosavi-Dezfooli and Pascal Frossard},
  booktitle={Neural Information Processing Systems},
  year={2021}
}
For certain infinitely-wide neural networks, the neural tangent kernel (NTK) theory fully characterizes generalization, but for the networks used in practice, the empirical NTK only provides a rough first-order approximation. Still, a growing body of work keeps leveraging this approximation to successfully analyze important deep learning phenomena and design algorithms for new applications. In our work, we provide strong empirical evidence to determine the practical validity of such… 

Limitations of the NTK for Understanding Generalization in Deep Learning

This work studies NTKs through the lens of scaling laws, and proves that they fall short of explaining important aspects of neural network generalization, and establishes concrete limitations of the NTK approach in understanding generalization of real networks on natural datasets.

Feature learning and random features in standard finite-width convolutional neural networks: An empirical study

It appears that feature learning for non-wide standard networks is important but becomes less significant with increasing width, and cases where both standard and linearized networks match in performance are identified, in agreement with NTK theory.

Learning sparse features can lead to overfitting in neural networks

It is shown that feature learning can perform worse than lazy training (via random feature kernel or the NTK) as the former can lead to a sparser neural representation, and it is empirically shown that learning features can indeed lead to sparse and thereby less smooth representations of the image predictors.

Neural Tangent Kernel Eigenvalues Accurately Predict Generalization

Though the theory is derived for infinite-width architectures, it finds it agrees with networks as narrow as width 20, suggesting it is predictive of generalization in practical neural networks.

Quadratic models for understanding neural network dynamics

It is shown that the extra quadratic term in NQMs allows for catapult convergence: the loss increases at early stage and then converges afterwards, and the top eigenvalues of the tangent kernel typically decrease after the catapult phase, while they are nearly constant when training with sub-critical learning rates, where the loss converges monotonically.

What Can the Neural Tangent Kernel Tell Us About Adversarial Robustness?

It is shown how NTKs allow to generate adversarial examples in a “training-free” fashion, and it is demonstrated that they transfer to fool their neural net counterparts in the “lazy” regime, and shed light on the robustness mechanism underlying adversarial training of neural nets used in practice.

Spectral evolution and invariance in linear-width neural networks

The results show that monitoring the evolution of the spectra during training is an important step toward understanding the training dynamics and feature learning.

Lazy vs hasty: linearization in deep networks impacts learning schedule based on example difficulty

It is shown that easier examples are given more weight in feature learning mode, resulting in faster training compared to more difficult ones, and a new understanding of how deep networks prioritize resources across example di-culty is revealed.

Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training?

In Neural Architecture Search (NAS), reducing the cost of architecture evaluation remains one of the most crucial challenges. Among a plethora of efforts to bypass training of each candidate

A Structured Dictionary Perspective on Implicit Neural Representations

It is shown that most INR families are analogous to structured signal dictionaries whose atoms are integer harmonics of the set of initial mapping frequencies, which allows INRs to express signals with an exponentially increasing frequency support using a number of parameters that only grows linearly with depth.

References

SHOWING 1-10 OF 45 REFERENCES

When do neural networks outperform kernel methods?

It is shown that this curse of dimensionality becomes milder if the covariates display the same low-dimensional structure as the target function, and a spiked covariates model is presented that can capture in a unified framework both behaviors observed in earlier work.

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

This work shows that for wide NNs the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.

What Can ResNet Learn Efficiently, Going Beyond Kernels?

It is proved neural networks can efficiently learn a notable class of functions, including those defined by three-layer residual networks with smooth activations, without any distributional assumption, and also proves a computation complexity advantage of ResNet with respect to other learning methods including linear regression over arbitrary feature mappings.

Uniform convergence may be unable to explain generalization in deep learning

Through numerous experiments, doubt is cast on the power of uniform convergence-based generalization bounds to provide a complete picture of why overparameterized deep networks generalize well.

Understanding deep learning requires rethinking generalization

These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.

Harnessing the Power of Infinitely Wide Deep Nets on Small-data Tasks

Results suggesting neural tangent kernels perform strongly on low-data tasks are reported, with comparing the performance of NTK with the finite-width net it was derived from, NTK behavior starts at lower net widths than suggested by theoretical analysis.

On Exact Computation with an Infinitely Wide Neural Net

The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which it is called Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm.

Fast Adaptation with Linearized Neural Networks

This work proposes a technique for embedding these inductive biases of linearizations of neural networks into Gaussian processes through a kernel designed from the Jacobian of the network, which develops significant computational speed-ups based on matrix multiplies, including a novel implementation for scalable Fisher vector products.

What Do Neural Networks Learn When Trained With Random Labels?

It is shown analytically for convolutional and fully connected networks that an alignment between the principal components of network parameters and data takes place when training with random labels, and how this alignment produces a positive transfer.

Neural Spectrum Alignment: Empirical Study

This paper empirically explore properties of NTK along the optimization and shows that in practical applications the NTK changes in a very dramatic and meaningful way, with its top eigenfunctions aligning toward the target function learned by NN.