Corpus ID: 235262800

Principal Components Bias in Deep Neural Networks

@inproceedings{Hacohen2021PrincipalCB,
  title={Principal Components Bias in Deep Neural Networks},
  author={Guy Hacohen and Daphna Weinshall},
  year={2021}
}
Recent work suggests that convolutional neural networks of different architectures learn to classify images in the same order. To understand this phenomenon, we revisit the over-parametrized deep linear network model. Our asymptotic analysis, assuming that the hidden layers are wide enough, reveals that the convergence rate of this model’s parameters is exponentially faster along directions corresponding to the larger principal components of the data, at a rate governed by the singular values… Expand
1 Citations
The Grammar-Learning Trajectories of Neural Language Models
The learning trajectories of linguistic phenomena provide insight into the nature of linguistic representation, beyond what can be gleaned from inspecting the behavior of an adult speaker. To apply aExpand

References

SHOWING 1-10 OF 56 REFERENCES
Towards Understanding the Generalization Bias of Two Layer Convolutional Linear Classifiers with Gradient Descent
TLDR
A general analysis of the generalization performance as a function of data distribution and convolutional filter size is provided, given gradient descent as the optimization algorithm, and the results are interpreted using concrete examples. Expand
On the Spectral Bias of Neural Networks
TLDR
This work shows that deep ReLU networks are biased towards low frequency functions, and studies the robustness of the frequency components with respect to parameter perturbation, to develop the intuition that the parameters must be finely tuned to express high frequency functions. Expand
The Implicit Bias of Depth: How Incremental Learning Drives Generalization
TLDR
The notion of incremental learning dynamics is defined and the conditions on depth and initialization for which this phenomenon arises in deep linear models are derived, proving that while shallow models can exhibit incrementallearning dynamics, they require the initialization to be exponentially small for these dynamics to present themselves. Expand
Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers
TLDR
It is proved that overparameterized neural networks can learn some notable concept classes, including two and three-layer networks with fewer parameters and smooth activations, and SGD (stochastic gradient descent) or its variants in polynomial time using polynomially many samples. Expand
SGD on Neural Networks Learns Functions of Increasing Complexity
TLDR
Key to the work is a new measure of how well one classifier explains the performance of another, based on conditional mutual information, which can be helpful in explaining why SGD-learned classifiers tend to generalize well even in the over-parameterized regime. Expand
Understanding deep learning requires rethinking generalization
TLDR
These experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data, and confirm that simple depth two neural networks already have perfect finite sample expressivity. Expand
Analysis of feature learning in weight-tied autoencoders via the mean field lens
TLDR
A new argument is proved which proves that the required number of neurons for autoencoder models is only polynomial in data dimension d, and conjecture that N is necessarily larger than a data-dependent intrinsic dimension, a behavior that is fundamentally different from previously studied setups. Expand
The Surprising Simplicity of the Early-Time Learning Dynamics of Neural Networks
TLDR
It is formally proved that, for a class of well-behaved input distributions, the early-time learning dynamics of a two-layer fully-connected neural network can be mimicked by training a simple linear model on the inputs. Expand
Implicit Regularization of Discrete Gradient Dynamics in Deep Linear Neural Networks
TLDR
This work studies the discrete gradient dynamics of the training of a two-layer linear network with the least-squares loss using a time rescaling to show that this dynamics sequentially learns the solutions of a reduced-rank regression with a gradually increasing rank. Expand
The Convergence Rate of Neural Networks for Learned Functions of Different Frequencies
TLDR
It is shown theoretically and experimentally that a shallow neural network without bias cannot represent or learn simple, low frequency functions with odd frequencies and specific predictions of the time it will take a network to learn functions of varying frequency are led. Expand
...
1
2
3
4
5
...