# Turing-Universal Learners with Optimal Scaling Laws

@article{Nakkiran2021TuringUniversalLW, title={Turing-Universal Learners with Optimal Scaling Laws}, author={Preetum Nakkiran}, journal={ArXiv}, year={2021}, volume={abs/2111.05321} }

For a given distribution, learning algorithm, and performance metric, the rate of convergence (or datascaling law) is the asymptotic behavior of the algorithm’s test performance as a function of number of train samples. Many learning methods in both theory and practice have power-law rates, i.e. performance scales as n−α for some α > 0. Moreover, both theoreticians and practitioners are concerned with improving the rates of their learning algorithms under settings of interest. We observe the…

## References

SHOWING 1-10 OF 28 REFERENCES

A theory of universal learning

- Computer Science, MathematicsSTOC
- 2021

There are only three possible rates of universal learning, which aims to understand the performance of learning algorithms on every data distribution, but without requiring uniformity over the distribution: exponential, linear, or arbitrarily slow rates.

Statistical Theory of Learning Curves under Entropic Loss Criterion

- Mathematics, Computer ScienceNeural Computation
- 1993

A universal property of learning curves is elucidated, which shows how the generalization error, training error, and the complexity of the underlying stochastic machine are related and how the behavior of a stochastics machine is improved as the number of training examples increases.

Learning Curve Theory

- Computer Science, MathematicsArXiv
- 2021

This work develops and theoretically analyse the simplest possible (toy) model that can exhibit n−β learning curves for arbitrary power β > 0, and determines whether power laws are universal or depend on the data distribution.

Complexity-based induction systems: Comparisons and convergence theorems

- Mathematics, Computer ScienceIEEE Trans. Inf. Theory
- 1978

Levin has shown that if tilde{P}'_{M}(x) is an unnormalized form of this measure, and P( x) is any computable probability measure on strings, x, then \tilde{M}'_M}\geqCP (x) where C is a constant independent of x .

The Shape of Learning Curves: a Review

- Computer ScienceArXiv
- 2021

This review recounts the origins of the term, provides a formal definition of the learning curve, and provides a comprehensive overview of the literature regarding the shape of learning curves.

Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm

- Mathematics, PhysicsJournal of Statistical Mechanics: Theory and Experiment
- 2020

The results quantify how smooth Gaussian data should be to avoid the curse of dimensionality, and indicate that for kernel learning the relevant dimension of the data is defined in terms of how the distance between nearest data points depends on $n$.

Any Discrimination Rule Can Have an Arbitrarily Bad Probability of Error for Finite Sample Size

- Mathematics, MedicineIEEE Transactions on Pattern Analysis and Machine Intelligence
- 1982

Any attempt to find a nontrivial distribution-free upper bound for Rn will fail, and any results on the rate of convergence of Rn to R* must use assumptions about the distribution of (X, Y).

Learning from examples in large neural networks.

- Computer Science, MedicinePhysical review letters
- 1990

Numerical results on training in layered neural networks indicate that the generalization error improves gradually in some cases, and sharply in others, and statistical mechanics is used to study generalization curves in large layered networks.

Understanding Machine Learning - From Theory to Algorithms

- Computer Science
- 2014

The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way in an advanced undergraduate or beginning graduate course.

Deep Learning Scaling is Predictable, Empirically

- Computer Science, MathematicsArXiv
- 2017

A large scale empirical characterization of generalization error and model size growth as training sets grow is presented and it is shown that model size scales sublinearly with data size.