# Revisiting minimum description length complexity in overparameterized models

@inproceedings{Dwivedi2020RevisitingMD, title={Revisiting minimum description length complexity in overparameterized models}, author={Raaz Dwivedi and Chandan Singh and Bin Yu and Martin J. Wainwright}, year={2020} }

Complexity is a fundamental concept underlying statistical learning theory that aims to inform generalization performance. Parameter count, while successful in low-dimensional settings, is not well-justified for overparameterized settings when the number of parameters is more than the number of training samples. We revisit complexity measures based on Rissanen’s principle of minimum description length (MDL) and define a novel MDL-based complexity (MDL-COMP) that remains valid for… Expand

#### References

SHOWING 1-10 OF 86 REFERENCES

The Contribution of Parameters to Stochastic Complexity

We consider the contribution of parameters to the stochastic complexity. The stochastic complexity of a class of models is the length of a universal, one-part code representing this class. It… Expand

Model Selection and the Principle of Minimum Description Length

- Mathematics
- 2001

This article reviews the principle of minimum description length (MDL) for problems of model selection. By viewing statistical modeling as a means of generating descriptions of observed data, the MDL… Expand

High-dimensional penalty selection via minimum description length principle

- Computer Science, Mathematics
- Machine Learning
- 2018

A novel regularization selection method, in which a tight upper bound of LNML (uLNML) is minimized with local convergence guarantee, and the experimental results show that MDL-RS improves the generalization performance of regularized estimates specifically when the model has redundant parameters. Expand

Benign overfitting in linear regression

- Computer Science, Mathematics
- Proceedings of the National Academy of Sciences
- 2020

A characterization of linear regression problems for which the minimum norm interpolating prediction rule has near-optimal prediction accuracy shows that overparameterization is essential for benign overfitting in this setting: the number of directions in parameter space that are unimportant for prediction must significantly exceed the sample size. Expand

Divide and conquer kernel ridge regression: a distributed algorithm with minimax optimal rates

- Mathematics, Computer Science
- J. Mach. Learn. Res.
- 2015

It is established that despite the computational speed-up, statistical optimality is retained: as long as m is not too large, the partition-based estimator achieves the statistical minimax rate over all estimators using the set of N samples. Expand

Measuring the Intrinsic Dimension of Objective Landscapes

- Computer Science, Mathematics
- ICLR
- 2018

Intrinsic dimension allows some quantitative comparison of problem difficulty across supervised, reinforcement, and other types of learning where it is concluded that solving the inverted pendulum problem is 100 times easier than classifying digits from MNIST, and playing Atari Pong from pixels is about as hard as classifying CIFAR-10. Expand

Degrees of freedom and model search

- Mathematics
- 2014

Degrees of freedom is a fundamental concept in statistical modeling, as it provides a quantitative description of the amount of fitting performed by a given procedure. But, despite this fundamental… Expand

A Tight Excess Risk Bound via a Unified PAC-Bayesian-Rademacher-Shtarkov-MDL Complexity

- Computer Science, Mathematics
- ALT
- 2019

These results recover optimal bounds for VC- and large (polynomial entropy) classes, replacing localized Rademacher complexity by a simpler analysis which almost completely separates the two aspects that determine the achievable rates: 'easiness' (Bernstein) conditions and model complexity. Expand

A Convergence Theory for Deep Learning via Over-Parameterization

- Computer Science, Mathematics
- ICML
- 2019

This work proves why stochastic gradient descent can find global minima on the training objective of DNNs in $\textit{polynomial time}$ and implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. Expand

The Minimum Description Length Principle in Coding and Modeling

- Computer Science, Mathematics
- IEEE Trans. Inf. Theory
- 1998

The normalized maximized likelihood, mixture, and predictive codings are each shown to achieve the stochastic complexity to within asymptotically vanishing terms. Expand