# Neural Networks as Kernel Learners: The Silent Alignment Effect

@article{Atanasov2021NeuralNA, title={Neural Networks as Kernel Learners: The Silent Alignment Effect}, author={Alexander Atanasov and Blake Bordelon and Cengiz Pehlevan}, journal={ArXiv}, year={2021}, volume={abs/2111.00034} }

Neural networks in the lazy training regime converge to kernel machines. Can neural networks in the rich feature learning regime learn a kernel machine with a data-dependent kernel? We demonstrate that this can indeed happen due to a phenomenon we term silent alignment, which requires that the tangent kernel of a network evolves in eigenstructure while small and before the loss appreciably decreases, and grows only in overall scale afterwards. We show that such an effect takes place in…

## Figures from this paper

## 2 Citations

Depth induces scale-averaging in overparameterized linear Bayesian neural networks

- Computer Science, MathematicsArXiv
- 2021

Finite deep linear Bayesian neural networks are interpreted as datadependent scale mixtures of Gaussian process predictors across output channels, allowing us to connect limiting results obtained in previous studies within a unified framework.

Properties of the After Kernel

- Computer ScienceArXiv
- 2021

The “after kernel” is studied, which is defined using the same embedding, except after training, for neural networks with standard architectures, on binary classification problems extracted from MNIST and CIFAR-10, trained using SGD in a standard way.

## References

SHOWING 1-10 OF 49 REFERENCES

Kernel and Rich Regimes in Overparametrized Models

- Computer Science, MathematicsCOLT
- 2020

This work shows how the scale of the initialization controls the transition between the "kernel" and "rich" regimes and affects generalization properties in multilayer homogeneous models and highlights an interesting role for the width of a model in the case that the predictor is not identically zero at initialization.

Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks

- Medicine, MathematicsNature communications
- 2021

This work investigates generalization error for kernel regression, and proposes a predictive theory of generalization in kernel regression applicable to real data, which explains various generalization phenomena observed in wide neural networks, which admit a kernel limit and generalize well despite being overparameterized.

Rapid Feature Evolution Accelerates Learning in Neural Networks

- Computer Science, MathematicsArXiv
- 2021

It is shown that feature evolution is faster and more dramatic in deeper networks, and networks with multiple output nodes develop separate, specialized kernels for each output channel, a phenomenon the authors termed kernel specialization.

Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel

- Computer Science, MathematicsNeurIPS
- 2020

A large-scale phenomenological analysis of training reveals a striking correlation between a diverse set of metrics over training time, governed by a rapid chaotic to stable transition in the first few epochs, that together poses challenges and opportunities for the development of more accurate theories of deep learning.

Feature Learning in Infinite-Width Neural Networks

- Computer Science, PhysicsArXiv
- 2020

It is shown that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT, and any such infinite- width limit can be computed using the Tensor Programs technique.

On Lazy Training in Differentiable Programming

- Computer ScienceNeurIPS
- 2019

This work shows that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels.

Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent

- Computer Science, MathematicsNeurIPS
- 2019

This work shows that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.

To understand deep learning we need to understand kernel learning

- Computer Science, MathematicsICML
- 2018

It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods.

Label-Aware Neural Tangent Kernel: Toward Better Generalization and Local Elasticity

- Computer Science, MathematicsNeurIPS
- 2020

A novel approach from the perspective of label-awareness to reduce the performance gap for the neural tangent kernels and shows that the models trained with the proposed kernels better simulate NNs in terms of generalization ability and local elasticity.

Every Model Learned by Gradient Descent Is Approximately a Kernel Machine

- Computer Science, MathematicsArXiv
- 2020

It is shown that deep networks learned by the standard gradient descent algorithm are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel).