Corpus ID: 216080945

A Neural Scaling Law from the Dimension of the Data Manifold

@article{Sharma2020ANS,
  title={A Neural Scaling Law from the Dimension of the Data Manifold},
  author={Utkarsh Sharma and Jared Kaplan},
  journal={ArXiv},
  year={2020},
  volume={abs/2004.10802}
}
When data is plentiful, the loss achieved by well-trained neural networks scales as a power-law $L \propto N^{-\alpha}$ in the number of network parameters $N$. This empirical scaling law holds for a wide variety of data modalities, and may persist over many orders of magnitude. The scaling law can be explained if neural models are effectively just performing regression on a data manifold of intrinsic dimension $d$. This simple theory predicts that the scaling exponents $\alpha \approx 4/d$ for… Expand
Explaining Neural Scaling Laws
TLDR
This work identifies variance-limited and resolution-limited scaling behavior for both dataset and model size, and identifies four related scaling regimes with respect to the number of model parameters P and the dataset size D. Expand
Scaling Laws for Autoregressive Generative Modeling
TLDR
The case that scaling laws have important implications for neural network performance, including on downstream tasks is strengthened, as empirical scaling laws for the cross-entropy loss are identified. Expand
Scaling Laws for Transfer
TLDR
This work finds that pre-training effectively multiplies the fine-tuning dataset size, and believes the exponents in these power-laws correspond to measures of the generality of a model and proximity of distributions (in a directed rather than symmetric sense). Expand
Learning Curve Theory
TLDR
This work develops and theoretically analyse the simplest possible (toy) model that can exhibit n−β learning curves for arbitrary power β > 0, and determines whether power laws are universal or depend on the data distribution. Expand
A Scaling Law for Synthetic-to-Real Transfer: A Measure of Pre-Training
TLDR
A simple and general scaling law is observed that consistently describes learning curves in various tasks, models, and complexities of synthesized pre-training data. Expand
Distributional Generalization: A New Kind of Generalization
We introduce a new notion of generalization -- Distributional Generalization -- which roughly states that outputs of a classifier at train and test time are close *as distributions*, as opposed toExpand
Topological Obstructions to Autoencoding
TLDR
This analysis ground this analysis in the discussion of a mock “bump hunt” in which the autoencoder fails to identify an anomalous “signal” for reasons tied to the intrinsic topology of n-particle phase space. Expand
Limits to Depth Efficiencies of Self-Attention
TLDR
By identifying network width as a limiting factor, the analysis indicates that solutions for dramatically increasing the width can facilitate the next leap in self-attention expressivity. Expand
MassGenie: a transformer-based deep learning method for identifying small molecules from their mass spectra
TLDR
The ability to create and to ‘learn’ millions of fragmentation patterns in silico, and to generate candidate structures that are very close or identical to those of the ‘true’ molecules directly opens up entirely the field of de novo small molecule structure prediction from experimental mass spectra. Expand
Towards Continual Reinforcement Learning: A Review and Perspectives
TLDR
A taxonomy of different continual RL formulations and mathematically characterize the non-stationary dynamics of each setting is provided, providing an overview of benchmarks used in the literature and important metrics for understanding agent performance. Expand
...
1
2
...

References

SHOWING 1-10 OF 30 REFERENCES
Asymptotic learning curves of kernel methods: empirical data v.s. Teacher-Student paradigm
TLDR
The results quantify how smooth Gaussian data should be to avoid the curse of dimensionality, and indicate that for kernel learning the relevant dimension of the data is defined in terms of how the distance between nearest data points depends on $n$. Expand
Intrinsic dimension of data representations in deep neural networks
TLDR
The intrinsic dimensionality of data-representations is studied, i.e. the minimal number of parameters needed to describe a representation, and it is found that, in a trained network, the ID is orders of magnitude smaller than the number of units in each layer. Expand
A Constructive Prediction of the Generalization Error Across Scales
TLDR
This work presents a functional form which approximates well the generalization error in practice, and shows that the form both fits the observations well across scales, and provides accurate predictions from small- to large-scale models and data. Expand
Scaling Laws for Neural Language Models
TLDR
Larger models are significantly more sample-efficient, such that optimally compute-efficient training involves training very large models on a relatively modest amount of data and stopping significantly before convergence. Expand
Measuring the Intrinsic Dimension of Objective Landscapes
TLDR
Intrinsic dimension allows some quantitative comparison of problem difficulty across supervised, reinforcement, and other types of learning where it is concluded that solving the inverted pendulum problem is 100 times easier than classifying digits from MNIST, and playing Atari Pong from pixels is about as hard as classifying CIFAR-10. Expand
Learning Multiple Layers of Features from Tiny Images
TLDR
It is shown how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex, using a novel parallelization algorithm to distribute the work among multiple machines connected on a network. Expand
Estimating the intrinsic dimension of datasets by a minimal neighborhood information
TLDR
A new ID estimator using only the distance of the first and the second nearest neighbor of each point in the sample is proposed, which enables us to reduce the effects of curvature, of density variation, and the resulting computational cost. Expand
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand
Efficient Representation of Low-Dimensional Manifolds using Deep Networks
TLDR
It is shown that the first two layers of a deep network can exactly embed points lying on a monotonic chain, a special type of piecewise linear manifold, mapping them to a low-dimensional Euclidean space. Expand
EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks
TLDR
A new scaling method is proposed that uniformly scales all dimensions of depth/width/resolution using a simple yet highly effective compound coefficient and is demonstrated the effectiveness of this method on scaling up MobileNets and ResNet. Expand
...
1
2
3
...