• Corpus ID: 243860726

Turing-Universal Learners with Optimal Scaling Laws

  title={Turing-Universal Learners with Optimal Scaling Laws},
  author={Preetum Nakkiran},
For a given distribution, learning algorithm, and performance metric, the rate of convergence (or datascaling law) is the asymptotic behavior of the algorithm’s test performance as a function of number of train samples. Many learning methods in both theory and practice have power-law rates, i.e. performance scales as n−α for some α > 0. Moreover, both theoreticians and practitioners are concerned with improving the rates of their learning algorithms under settings of interest. We observe the… 


A theory of universal learning
There are only three possible rates of universal learning, which aims to understand the performance of learning algorithms on every data distribution, but without requiring uniformity over the distribution: exponential, linear, or arbitrarily slow rates.
Statistical Theory of Learning Curves under Entropic Loss Criterion
A universal property of learning curves is elucidated, which shows how the generalization error, training error, and the complexity of the underlying stochastic machine are related and how the behavior of a stochastics machine is improved as the number of training examples increases.
Learning Curve Theory
This work develops and theoretically analyse the simplest possible (toy) model that can exhibit n−β learning curves for arbitrary power β > 0, and determines whether power laws are universal or depend on the data distribution.
Complexity-based induction systems: Comparisons and convergence theorems
  • R. Solomonoff
  • Mathematics, Computer Science
    IEEE Trans. Inf. Theory
  • 1978
Levin has shown that if tilde{P}'_{M}(x) is an unnormalized form of this measure, and P( x) is any computable probability measure on strings, x, then \tilde{M}'_M}\geqCP (x) where C is a constant independent of x .
The Shape of Learning Curves: a Review
This review recounts the origins of the term, provides a formal definition of the learning curve, and provides a comprehensive overview of the literature regarding the shape of learning curves.
Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm
The results quantify how smooth Gaussian data should be to avoid the curse of dimensionality, and indicate that for kernel learning the relevant dimension of the data is defined in terms of how the distance between nearest data points depends on $n$.
Any Discrimination Rule Can Have an Arbitrarily Bad Probability of Error for Finite Sample Size
  • L. Devroye
  • Mathematics, Medicine
    IEEE Transactions on Pattern Analysis and Machine Intelligence
  • 1982
Any attempt to find a nontrivial distribution-free upper bound for Rn will fail, and any results on the rate of convergence of Rn to R* must use assumptions about the distribution of (X, Y).
Learning from examples in large neural networks.
Numerical results on training in layered neural networks indicate that the generalization error improves gradually in some cases, and sharply in others, and statistical mechanics is used to study generalization curves in large layered networks.
Understanding Machine Learning - From Theory to Algorithms
The aim of this textbook is to introduce machine learning, and the algorithmic paradigms it offers, in a principled way in an advanced undergraduate or beginning graduate course.
Deep Learning Scaling is Predictable, Empirically
A large scale empirical characterization of generalization error and model size growth as training sets grow is presented and it is shown that model size scales sublinearly with data size.