# Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers

@article{Wei2021StatisticallyMA, title={Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers}, author={Colin Wei and Yining Chen and Tengyu Ma}, journal={ArXiv}, year={2021}, volume={abs/2107.13163} }

A common lens to theoretically study neural net architectures is to analyze the functions they can approximate. However, constructions from approximation theory may be unrealistic and therefore less meaningful. For example, a common unrealistic trick is to encode target function values using infinite precision. To address these issues, this work proposes a formal definition of statistically meaningful (SM) approximation which requires the approximating network to exhibit good statistical…

## 2 Citations

### Recurrent Convolutional Neural Networks Learn Succinct Learning Algorithms

- Computer Science
- 2022

This work exhibits a NN architecture that, in polynomial time, learns as well as any eﬃcient learning algorithm describable by a constant-sized learning algorithm, suggesting that the synergy of Recurrent and Convolutional NNs may be more powerful than either alone.

### Inductive Biases and Variable Creation in Self-Attention Mechanisms

- Computer ScienceICML
- 2022

The main result shows that bounded-norm Transformer networks “create sparse variables”: a single self-attention head can represent a sparse function of the input sequence, with sample complexity scaling only logarithmically with the context length.

## References

SHOWING 1-10 OF 57 REFERENCES

### Neural Networks with Small Weights and Depth-Separation Barriers

- Computer ScienceElectron. Colloquium Comput. Complex.
- 2020

This paper provides a negative and constructive answer to whether there are polynomially-bounded functions which require super-polynomial weights in order to approximate with constant-depth neural networks, and proves fundamental barriers to proving such results beyond depth $4$ by reduction to open problems and natural-proof barriers in circuit complexity.

### On the Computational Power of Transformers and Its Implications in Sequence Modeling

- Computer ScienceCONLL
- 2020

This paper provides an alternate and simpler proof to show that vanilla Transformers are Turing-complete and proves that Transformers with only positional masking and without any positional encoding are also Turing- complete.

### The Connection Between Approximation, Depth Separation and Learnability in Neural Networks

- Computer ScienceCOLT
- 2021

It is shown that a necessary condition for a function to be learnable by gradient descent on deep neural networks is to be able to approximate the function, at least in a weak sense, with shallow neural networks.

### RNNs Can Generate Bounded Hierarchical Languages with Optimal Memory

- Computer ScienceEMNLP
- 2020

Dyck- is introduced, the language of well-nested brackets and nesting depth, reflecting the bounded memory needs and long-distance dependencies of natural language syntax, and it is proved that an RNN with $O(m \log k)$ hidden units suffices, an exponential reduction in memory, by an explicit construction.

### On the Computational Power of Neural Nets

- Computer ScienceJ. Comput. Syst. Sci.
- 1995

It is proved that one may simulate all Turing Machines by rational nets in linear time, and there is a net made up of about 1,000 processors which computes a universal partial-recursive function.

### On the Turing Completeness of Modern Neural Network Architectures

- Computer ScienceICLR
- 2019

This study studies the computational power of two of the most paradigmatic architectures exemplifying these mechanisms: the Transformer and the Neural GPU, and shows both models to be Turing complete exclusively based on their capacity to compute and access internal dense representations of the data.

### Depth-Width Trade-offs for ReLU Networks via Sharkovsky's Theorem

- Computer ScienceICLR
- 2020

A new connection between DNNs expressivity and Sharkovsky's Theorem from dynamical systems is pointed to, that enables us to characterize the depth-width trade-offs of ReLU networks for representing functions based on the presence of generalized notion of fixed points, called periodic points.

### Computational Capabilities of Graph Neural Networks

- Computer Science, MathematicsIEEE Transactions on Neural Networks
- 2009

The functions that can be approximated by GNNs, in probability, up to any prescribed degree of precision are described, and includes most of the practically useful functions on graphs.

### Vapnik-Chervonenkis Dimension of Recurrent Neural Networks

- Computer ScienceDiscret. Appl. Math.
- 1998

### On the Expressive Power of Deep Learning: A Tensor Analysis

- Computer ScienceCOLT 2016
- 2015

It is proved that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network.