• Corpus ID: 235166077

Compressing Heavy-Tailed Weight Matrices for Non-Vacuous Generalization Bounds

@article{Shin2021CompressingHW,
  title={Compressing Heavy-Tailed Weight Matrices for Non-Vacuous Generalization Bounds},
  author={John Y. Shin},
  journal={ArXiv},
  year={2021},
  volume={abs/2105.11025}
}
Heavy-tailed distributions have been studied in statistics, random matrix theory, physics, and econometrics as models of correlated systems, among other domains. Further, heavy-tail distributed eigenvalues of the covariance matrix of the weight matrices in neural networks have been shown to empirically correlate with test set accuracy in several works (e.g. [1]), but a formal relationship between heavy-tail distributed parameters and generalization bounds was yet to be demonstrated. In this… 

Figures and Tables from this paper

Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks

This study links compressibility to two recently established properties of SGD, and proves that the networks are guaranteed to be ‘`p-compressible’, and the compression errors of different pruning techniques become arbitrarily small as the network size increases.

Deep neural networks with dependent weights: Gaussian Process mixture limit, heavy tails, sparsity and compressibility

The infinite-width limit of deep feedforward neural networks whose weights are dependent, and modelled via a mixture of Gaussian distributions is studied, and it is shown that, in this regime, the weights are compressible, and feature learning is possible.

References

SHOWING 1-10 OF 36 REFERENCES

Spectral Properties of Heavy-Tailed Random Matrices

The classical Random Matrix Theory studies asymptotic spectral properties of random matrices when their dimensions grow to infinity. In contrast, the non-asymptotic branch of the theory is focused on

Traditional and Heavy-Tailed Self Regularization in Neural Network Models

A novel form of Heavy-Tailed Self-Regularization is identified, similar to the self-organization seen in the statistical physics of disordered systems, which can depend strongly on the many knobs of the training process.

Heavy-Tailed Universality Predicts Trends in Test Accuracies for Very Large Pre-Trained Deep Neural Networks

A new Theory of Heavy-Tailed Self-Regularization (HT-SR) is used to develop a Universal capacity control metric that is a weighted average of PL exponents, and it correlates very well with the reported test accuracies of these DNNs.

The Heavy-Tail Phenomenon in SGD

It is claimed that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters $\eta$ and $b$, the SGD iterates will converge to a stationary distribution, and these results are the first of their kind to rigorously characterize the empirically observed heavy-tailed behavior of SGD.

Multiplicative noise and heavy tails in stochastic optimization

Modelling stochastic optimization algorithms as discrete random recurrence relations, it is shown that multiplicative noise, as it commonly arises due to variance in local rates of convergence, results in heavy-tailed stationary behaviour in the parameters.

On the top eigenvalue of heavy-tailed random matrices

We study the statistics of the largest eigenvalue λmax of N × N random matrices with IID entries of variance 1/N, but with power law tails P(Mij) ∼ |Mij|−1−μ. When μ > 4, λmax converges to 2 with

On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks

It is argued that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate and establish an explicit connection between the convergence rate of SGD to a local minimum and the tail-index $\alpha$.

Sharp Concentration Results for Heavy-Tailed Distributions

The main theorem can not only recover some of the existing results, such as the concentration of the sum of subWeibull random variables, but it can also produce new results for theSum of random variables with heavier tails, which are based on standard truncation arguments.

Stronger generalization bounds for deep nets via a compression approach

These results provide some theoretical justification for widespread empirical success in compressing deep nets and show generalization bounds that're orders of magnitude better in practice.

Level Statistics and Localization Transitions of Lévy Matrices.

This work establishes the equation determining the localization transition and obtains the phase diagram, and shows that the eigenvalue statistics is the same one as of the Gaussian orthogonal ensemble in the whole delocalized phase and is Poisson-like in the localized phase.