• Corpus ID: 240354190

Neural Networks as Kernel Learners: The Silent Alignment Effect

  title={Neural Networks as Kernel Learners: The Silent Alignment Effect},
  author={Alexander Atanasov and Blake Bordelon and Cengiz Pehlevan},
Neural networks in the lazy training regime converge to kernel machines. Can neural networks in the rich feature learning regime learn a kernel machine with a data-dependent kernel? We demonstrate that this can indeed happen due to a phenomenon we term silent alignment, which requires that the tangent kernel of a network evolves in eigenstructure while small and before the loss appreciably decreases, and grows only in overall scale afterwards. We show that such an effect takes place in… 

Figures from this paper

Depth induces scale-averaging in overparameterized linear Bayesian neural networks
Finite deep linear Bayesian neural networks are interpreted as datadependent scale mixtures of Gaussian process predictors across output channels, allowing us to connect limiting results obtained in previous studies within a unified framework.
Properties of the After Kernel
The “after kernel” is studied, which is defined using the same embedding, except after training, for neural networks with standard architectures, on binary classification problems extracted from MNIST and CIFAR-10, trained using SGD in a standard way.


Kernel and Rich Regimes in Overparametrized Models
This work shows how the scale of the initialization controls the transition between the "kernel" and "rich" regimes and affects generalization properties in multilayer homogeneous models and highlights an interesting role for the width of a model in the case that the predictor is not identically zero at initialization.
Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks
This work investigates generalization error for kernel regression, and proposes a predictive theory of generalization in kernel regression applicable to real data, which explains various generalization phenomena observed in wide neural networks, which admit a kernel limit and generalize well despite being overparameterized.
Rapid Feature Evolution Accelerates Learning in Neural Networks
It is shown that feature evolution is faster and more dramatic in deeper networks, and networks with multiple output nodes develop separate, specialized kernels for each output channel, a phenomenon the authors termed kernel specialization.
Deep learning versus kernel learning: an empirical study of loss landscape geometry and the time evolution of the Neural Tangent Kernel
A large-scale phenomenological analysis of training reveals a striking correlation between a diverse set of metrics over training time, governed by a rapid chaotic to stable transition in the first few epochs, that together poses challenges and opportunities for the development of more accurate theories of deep learning.
Feature Learning in Infinite-Width Neural Networks
It is shown that the standard and NTK parametrizations of a neural network do not admit infinite-width limits that can learn features, which is crucial for pretraining and transfer learning such as with BERT, and any such infinite- width limit can be computed using the Tensor Programs technique.
On Lazy Training in Differentiable Programming
This work shows that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels.
Wide Neural Networks of Any Depth Evolve as Linear Models Under Gradient Descent
This work shows that for wide neural networks the learning dynamics simplify considerably and that, in the infinite width limit, they are governed by a linear model obtained from the first-order Taylor expansion of the network around its initial parameters.
To understand deep learning we need to understand kernel learning
It is argued that progress on understanding deep learning will be difficult until more tractable "shallow" kernel methods are better understood, and a need for new theoretical ideas for understanding properties of classical kernel methods.
Label-Aware Neural Tangent Kernel: Toward Better Generalization and Local Elasticity
A novel approach from the perspective of label-awareness to reduce the performance gap for the neural tangent kernels and shows that the models trained with the proposed kernels better simulate NNs in terms of generalization ability and local elasticity.
Every Model Learned by Gradient Descent Is Approximately a Kernel Machine
It is shown that deep networks learned by the standard gradient descent algorithm are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel).