• Corpus ID: 236469455

Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers

@article{Wei2021StatisticallyMA,
  title={Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers},
  author={Colin Wei and Yining Chen and Tengyu Ma},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.13163}
}
A common lens to theoretically study neural net architectures is to analyze the functions they can approximate. However, constructions from approximation theory may be unrealistic and therefore less meaningful. For example, a common unrealistic trick is to encode target function values using infinite precision. To address these issues, this work proposes a formal definition of statistically meaningful (SM) approximation which requires the approximating network to exhibit good statistical… 

Vision Transformers provably learn spatial structure

This paper proposes a spatially structured dataset and a simplified ViT model and proposes a mechanism that implicitly learns the spatial structure of the dataset while generalizing, and proves that patch association helps to sample-efficiently transfer to downstream datasets that share the same structure as the pre-training one but differ in the features.

Same Pre-training Loss, Better Downstream: Implicit Bias Matters for Language Models

It is proved that SGD with standard mini-batch noise implicitly prefersflatter minima in language models, and empirically observe a strong correlation between flatness and downstream performance among models with the same minimal pre-training loss.

Inductive Biases and Variable Creation in Self-Attention Mechanisms

The main result shows that bounded-norm Transformer networks “cre-ate sparse variables”: a single self-attention head can represent a sparse function of the input sequence, with sample complexity scaling only logarithmically with the context length.

What learning algorithm is in-context learning? Investigations with linear models

The hypothesis that transformer-based in-context learners implement standard learning algorithms implicitly, by encoding smaller models in their activations, and updating these implicit models as new examples appear in the context is investigated, suggesting that in- context learning is understandable in algorithmic terms, and that (at least in the linear case) learners may rediscover standard estimation algorithms.

Recurrent Convolutional Neural Networks Learn Succinct Learning Algorithms

This work exhibits a NN architecture that, in polynomial time, learns as well as any efficient learning algorithm describable by a constant-sized learning algorithm, suggesting that the synergy of Recurrent and Convolutional NNs may be more powerful than either alone.

References

SHOWING 1-10 OF 57 REFERENCES

Neural Networks with Small Weights and Depth-Separation Barriers

This paper provides a negative and constructive answer to whether there are polynomially-bounded functions which require super-polynomial weights in order to approximate with constant-depth neural networks, and proves fundamental barriers to proving such results beyond depth $4$ by reduction to open problems and natural-proof barriers in circuit complexity.

The Connection Between Approximation, Depth Separation and Learnability in Neural Networks

It is shown that a necessary condition for a function to be learnable by gradient descent on deep neural networks is to be able to approximate the function, at least in a weak sense, with shallow neural networks.

RNNs Can Generate Bounded Hierarchical Languages with Optimal Memory

Dyck- is introduced, the language of well-nested brackets and nesting depth, reflecting the bounded memory needs and long-distance dependencies of natural language syntax, and it is proved that an RNN with $O(m \log k)$ hidden units suffices, an exponential reduction in memory, by an explicit construction.

On the Computational Power of Neural Nets

It is proved that one may simulate all Turing Machines by rational nets in linear time, and there is a net made up of about 1,000 processors which computes a universal partial-recursive function.

On the Turing Completeness of Modern Neural Network Architectures

This study studies the computational power of two of the most paradigmatic architectures exemplifying these mechanisms: the Transformer and the Neural GPU, and shows both models to be Turing complete exclusively based on their capacity to compute and access internal dense representations of the data.

Depth-Width Trade-offs for ReLU Networks via Sharkovsky's Theorem

A new connection between DNNs expressivity and Sharkovsky's Theorem from dynamical systems is pointed to, that enables us to characterize the depth-width trade-offs of ReLU networks for representing functions based on the presence of generalized notion of fixed points, called periodic points.

Computational Capabilities of Graph Neural Networks

The functions that can be approximated by GNNs, in probability, up to any prescribed degree of precision are described, and includes most of the practically useful functions on graphs.

Vapnik-Chervonenkis Dimension of Recurrent Neural Networks

On the Expressive Power of Deep Learning: A Tensor Analysis

It is proved that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network.

Neural tangent kernels, transportation mappings, and universal approximation

This paper establishes rates of universal approximation for the shallow neural tangent kernel (NTK): network weights are only allowed microscopic changes from random initialization, which entails
...