• Corpus ID: 240354190

Neural Networks as Kernel Learners: The Silent Alignment Effect

  title={Neural Networks as Kernel Learners: The Silent Alignment Effect},
  author={Alexander Atanasov and Blake Bordelon and Cengiz Pehlevan},
Neural networks in the lazy training regime converge to kernel machines. Can neural networks in the rich feature learning regime learn a kernel machine with a data-dependent kernel? We demonstrate that this can indeed happen due to a phenomenon we term silent alignment, which requires that the tangent kernel of a network evolves in eigenstructure while small and before the loss appreciably decreases, and grows only in overall scale afterwards. We show that such an effect takes place in… 

Figures from this paper

Self-Consistent Dynamical Field Theory of Kernel Evolution in Wide Neural Networks
Comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory are provided, showing that each of these approximations can break down in regimes where general self- Consistent solutions still provide an accurate description.
A Theory of Neural Tangent Kernel Alignment and Its Influence on Training
This work seeks to theoretically understand kernel alignment, a prominent and ubiquitous structural change that aligns the NTK with the target function, and identifies factors in network architecture and data structure that drive kernel alignment.
Depth induces scale-averaging in overparameterized linear Bayesian neural networks
Finite deep linear Bayesian neural networks are interpreted as data-dependent scale mixtures of Gaussian process predictors across output channels to study representation learning in these networks, allowing us to connect limiting results obtained in previous studies within a unified framework.
Properties of the After Kernel
The “after kernel” is studied, which is defined using the same embedding, except after training, for neural networks with standard architectures, on binary classification problems extracted from MNIST and CIFAR-10, trained using SGD in a standard way.


On Lazy Training in Differentiable Programming
This work shows that this "lazy training" phenomenon is not specific to over-parameterized neural networks, and is due to a choice of scaling that makes the model behave as its linearization around the initialization, thus yielding a model equivalent to learning with positive-definite kernels.
Exact solutions to the nonlinear dynamics of learning in deep linear neural network
  • In In International Conference on Learning Representations,
  • 2014
Capturing the learning curves of generic features maps for realistic data sets with a teacher-student model
A rigorous formula is proved for the asymptotic training loss and generalisation error achieved by empirical risk minimization for the high-dimensional Gaussian covariate model used in teacher-student models.
Landscape and training regimes in deep learning
Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced
It is rigorously proved that gradient flow effectively enforces the differences between squared norms across different layers to remain invariant without any explicit regularization, which implies that if the weights are initially small, gradient flow automatically balances the magnitudes of all layers.
Neural tangent kernel: convergence and generalization in neural networks (invited paper)
This talk will introduce this formalism and give a number of results on the Neural Tangent Kernel and explain how they give us insight into the dynamics of neural networks during training and into their generalization features.
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
This paper suggests that, sometimes, increasing depth can speed up optimization and proves that it is mathematically impossible to obtain the acceleration effect of overparametrization via gradients of any regularizer.
Effect of Batch Learning in Multilayer Neural Networks
Experimental study on multilayer perceptrons and linear neural networks (LNN) shows that batch learning induces strong overtrain-ing on both models in overrealizable cases, which means the degrade of generalization error by surplus units can be alleviated.
Spectral bias and task-model alignment explain generalization in kernel regression and infinitely wide neural networks
This work investigates generalization error for kernel regression, and proposes a predictive theory of generalization in kernel regression applicable to real data, which explains various generalization phenomena observed in wide neural networks, which admit a kernel limit and generalize well despite being overparameterized.
Spectrum Dependent Learning Curves in Kernel Regression and Wide Neural Networks
A new spectral principle is identified: as the size of the training set grows, kernel machines and neural networks fit successively higher spectral modes of the target function.