• Corpus ID: 243860726

Turing-Universal Learners with Optimal Scaling Laws

@article{Nakkiran2021TuringUniversalLW,
  title={Turing-Universal Learners with Optimal Scaling Laws},
  author={Preetum Nakkiran},
  journal={ArXiv},
  year={2021},
  volume={abs/2111.05321}
}
For a given distribution, learning algorithm, and performance metric, the rate of convergence (or datascaling law) is the asymptotic behavior of the algorithm’s test performance as a function of number of train samples. Many learning methods in both theory and practice have power-law rates, i.e. performance scales as n−α for some α > 0. Moreover, both theoreticians and practitioners are concerned with improving the rates of their learning algorithms under settings of interest. We observe the… 

References

SHOWING 1-10 OF 28 REFERENCES

A theory of universal learning

There are only three possible rates of universal learning, which aims to understand the performance of learning algorithms on every data distribution, but without requiring uniformity over the distribution: exponential, linear, or arbitrarily slow rates.

Learning Curve Theory

This work develops and theoretically analyse the simplest possible (toy) model that can exhibit n−β learning curves for arbitrary power β > 0, and determines whether power laws are universal or depend on the data distribution.

Complexity-based induction systems: Comparisons and convergence theorems

Levin has shown that if tilde{P}'_{M}(x) is an unnormalized form of this measure, and P( x) is any computable probability measure on strings, x, then \tilde{M}'_M}\geqCP (x) where C is a constant independent of x .

The Shape of Learning Curves: a Review

This review recounts the origins of the term, provides a formal definition of the learning curve, and provides a comprehensive overview of the literature regarding the shape of learning curves.

Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm

This work measures β when applying kernel methods to real datasets, and argues that these rather large exponents are possible due to the small effective dimension of the data.

Any Discrimination Rule Can Have an Arbitrarily Bad Probability of Error for Finite Sample Size

  • L. Devroye
  • Mathematics
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 1982
Any attempt to find a nontrivial distribution-free upper bound for Rn will fail, and any results on the rate of convergence of Rn to R* must use assumptions about the distribution of (X, Y).

Learning from examples in large neural networks.

Numerical results on training in layered neural networks indicate that the generalization error improves gradually in some cases, and sharply in others, and statistical mechanics is used to study generalization curves in large layered networks.

Understanding Machine Learning - From Theory to Algorithms

The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way in an advanced undergraduate or beginning graduate course.

Deep Learning Scaling is Predictable, Empirically

A large scale empirical characterization of generalization error and model size growth as training sets grow is presented and it is shown that model size scales sublinearly with data size.

A Theory of Universal Artificial Intelligence based on Algorithmic Complexity

This work constructs a modified algorithm AI tl, which is still eectively more intelligent than any other time t and space l bounded agent, and gives strong arguments that the resulting AI model is the most intelligent unbiased agent possible.