• Corpus ID: 235422672

What can linearized neural networks actually say about generalization?

@inproceedings{OrtizJimnez2021WhatCL,
  title={What can linearized neural networks actually say about generalization?},
  author={Guillermo Ortiz-Jim{\'e}nez and Seyed-Mohsen Moosavi-Dezfooli and Pascal Frossard},
  booktitle={NeurIPS},
  year={2021}
}
For certain infinitely-wide neural networks, the neural tangent kernel (NTK) theory fully characterizes generalization, but for the networks used in practice, the empirical NTK only provides a rough first-order approximation. Still, a growing body of work keeps leveraging this approximation to successfully analyze important deep learning phenomena and design algorithms for new applications. In our work, we provide strong empirical evidence to determine the practical validity of such… 

Limitations of the NTK for Understanding Generalization in Deep Learning

TLDR
This work studies NTKs through the lens of scaling laws, and proves that they fall short of explaining important aspects of neural network generalization, and establishes concrete limitations of the NTK approach in understanding generalization of real networks on natural datasets.

Learning sparse features can lead to overfitting in neural networks

TLDR
It is shown that feature learning can perform worse than lazy training (via random feature kernel or the NTK) as the former can lead to a sparser neural representation, and it is empirically shown that learning features can indeed lead to sparse and thereby less smooth representations of the image predictors.

Neural Tangent Kernel Eigenvalues Accurately Predict Generalization

TLDR
Though the theory is derived for infinite-width architectures, it finds it agrees with networks as narrow as width 20, suggesting it is predictive of generalization in practical neural networks.

Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training?

In Neural Architecture Search (NAS), reducing the cost of architecture evaluation remains one of the most crucial challenges. Among a plethora of efforts to bypass training of each candidate

Lazy vs hasty: linearization in deep networks impacts learning schedule based on example difficulty

TLDR
It is shown that easier examples are given more weight in feature learning mode, resulting in faster training compared to more difficult ones, and a new understanding of how deep networks prioritize resources across example di-culty is revealed.

A Structured Dictionary Perspective on Implicit Neural Representations

TLDR
It is shown that most INR families are analogous to structured signal dictionaries whose atoms are integer harmonics of the set of initial mapping frequencies, which allows INRs to express signals with an exponentially increasing frequency support using a number of parameters that only grows linearly with depth.

Can we achieve robustness from data alone?

TLDR
This work devise a meta-learning method for robust classification, that optimizes the dataset prior to its deployment in a principled way, and aims to effectively remove the non-robust parts of the data.

Representation Alignment in Neural Networks

TLDR
It is demonstrated that representation alignment may play an important role in neural network representations, and in a classic synthetic transfer problem, why alignment between the top singular vectors and the targets promote transfer.

Understanding Feature Transfer Through Representation Alignment

TLDR
It is found that training neural networks with different architectures and optimizers on random or true labels enforces the same relationship between the hidden representations and the training labels, elucidating why neural network representations have been so successful for transfer.

Quadratic models for understanding neural network dynamics

TLDR
It is shown that the extra quadratic term in NQMs allows for catapult convergence: the loss increases at early stage and then converges afterwards, and the top eigenvalues of the tangent kernel typically decrease after the catapult phase, while they are nearly constant when training with sub-critical learning rates, where the loss converges monotonically.

References

SHOWING 1-10 OF 45 REFERENCES

When do neural networks outperform kernel methods?

TLDR
It is shown that this curse of dimensionality becomes milder if the covariates display the same low-dimensional structure as the target function, and a spiked covariates model is presented that can capture in a unified framework both behaviors observed in earlier work.

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

TLDR
This work shows that for wide NNs the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.

What Can ResNet Learn Efficiently, Going Beyond Kernels?

TLDR
It is proved neural networks can efficiently learn a notable class of functions, including those defined by three-layer residual networks with smooth activations, without any distributional assumption, and also proves a computation complexity advantage of ResNet with respect to other learning methods including linear regression over arbitrary feature mappings.

Uniform convergence may be unable to explain generalization in deep learning

TLDR
Through numerous experiments, doubt is cast on the power of uniform convergence-based generalization bounds to provide a complete picture of why overparameterized deep networks generalize well.

Understanding deep learning requires rethinking generalization

TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity.

On Exact Computation with an Infinitely Wide Neural Net

TLDR
The current paper gives the first efficient exact algorithm for computing the extension of NTK to convolutional neural nets, which it is called Convolutional NTK (CNTK), as well as an efficient GPU implementation of this algorithm.

Fast Adaptation with Linearized Neural Networks

TLDR
This work proposes a technique for embedding these inductive biases of linearizations of neural networks into Gaussian processes through a kernel designed from the Jacobian of the network, which develops significant computational speed-ups based on matrix multiplies, including a novel implementation for scalable Fisher vector products.

What Do Neural Networks Learn When Trained With Random Labels?

TLDR
It is shown analytically for convolutional and fully connected networks that an alignment between the principal components of network parameters and data takes place when training with random labels, and how this alignment produces a positive transfer.

Neural Spectrum Alignment: Empirical Study

TLDR
This paper empirically explore properties of NTK along the optimization and shows that in practical applications the NTK changes in a very dramatic and meaningful way, with its top eigenfunctions aligning toward the target function learned by NN.

On the Inductive Bias of Neural Tangent Kernels

TLDR
This work studies smoothness, approximation, and stability properties of functions with finite norm, including stability to image deformations in the case of convolutional networks, and compares to other known kernels for similar architectures.