• Corpus ID: 235166077

# Compressing Heavy-Tailed Weight Matrices for Non-Vacuous Generalization Bounds

@article{Shin2021CompressingHW,
title={Compressing Heavy-Tailed Weight Matrices for Non-Vacuous Generalization Bounds},
author={John Y. Shin},
journal={ArXiv},
year={2021},
volume={abs/2105.11025}
}
Heavy-tailed distributions have been studied in statistics, random matrix theory, physics, and econometrics as models of correlated systems, among other domains. Further, heavy-tail distributed eigenvalues of the covariance matrix of the weight matrices in neural networks have been shown to empirically correlate with test set accuracy in several works (e.g. [1]), but a formal relationship between heavy-tail distributed parameters and generalization bounds was yet to be demonstrated. In this…
2 Citations

## Figures and Tables from this paper

• Computer Science
NeurIPS
• 2021
This study links compressibility to two recently established properties of SGD, and proves that the networks are guaranteed to be ‘`p-compressible’, and the compression errors of different pruning techniques become arbitrarily small as the network size increases.
• Computer Science
ArXiv
• 2022
The inﬁnite-width limit of deep feedforward neural networks whose weights are dependent, and modelled via a mixture of Gaussian distributions is studied, and it is shown that, in this regime, the weights are compressible, and feature learning is possible.

## References

SHOWING 1-10 OF 36 REFERENCES

The classical Random Matrix Theory studies asymptotic spectral properties of random matrices when their dimensions grow to infinity. In contrast, the non-asymptotic branch of the theory is focused on
• Computer Science
ICML
• 2019
A novel form of Heavy-Tailed Self-Regularization is identified, similar to the self-organization seen in the statistical physics of disordered systems, which can depend strongly on the many knobs of the training process.
• Computer Science
SDM
• 2020
A new Theory of Heavy-Tailed Self-Regularization (HT-SR) is used to develop a Universal capacity control metric that is a weighted average of PL exponents, and it correlates very well with the reported test accuracies of these DNNs.
• Computer Science
ICML
• 2021
It is claimed that depending on the structure of the Hessian of the loss at the minimum, and the choices of the algorithm parameters $\eta$ and $b$, the SGD iterates will converge to a stationary distribution, and these results are the first of their kind to rigorously characterize the empirically observed heavy-tailed behavior of SGD.
• Computer Science
ICML
• 2021
Modelling stochastic optimization algorithms as discrete random recurrence relations, it is shown that multiplicative noise, as it commonly arises due to variance in local rates of convergence, results in heavy-tailed stationary behaviour in the parameters.
• Mathematics
• 2007
We study the statistics of the largest eigenvalue λmax of N × N random matrices with IID entries of variance 1/N, but with power law tails P(Mij) ∼ |Mij|−1−μ. When μ > 4, λmax converges to 2 with
• Computer Science
ArXiv
• 2019
It is argued that the Gaussianity assumption might fail to hold in deep learning settings and hence render the Brownian motion-based analyses inappropriate and establish an explicit connection between the convergence rate of SGD to a local minimum and the tail-index $\alpha$.
• Mathematics
ArXiv
• 2020
The main theorem can not only recover some of the existing results, such as the concentration of the sum of subWeibull random variables, but it can also produce new results for theSum of random variables with heavier tails, which are based on standard truncation arguments.
• Computer Science
ICML
• 2018
These results provide some theoretical justification for widespread empirical success in compressing deep nets and show generalization bounds that're orders of magnitude better in practice.
• Mathematics
Physical review letters
• 2016
This work establishes the equation determining the localization transition and obtains the phase diagram, and shows that the eigenvalue statistics is the same one as of the Gaussian orthogonal ensemble in the whole delocalized phase and is Poisson-like in the localized phase.