• Corpus ID: 236469455

Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers

@article{Wei2021StatisticallyMA,
  title={Statistically Meaningful Approximation: a Case Study on Approximating Turing Machines with Transformers},
  author={Colin Wei and Yining Chen and Tengyu Ma},
  journal={ArXiv},
  year={2021},
  volume={abs/2107.13163}
}
A common lens to theoretically study neural net architectures is to analyze the functions they can approximate. However, constructions from approximation theory may be unrealistic and therefore less meaningful. For example, a common unrealistic trick is to encode target function values using infinite precision. To address these issues, this work proposes a formal definition of statistically meaningful (SM) approximation which requires the approximating network to exhibit good statistical… 

Recurrent Convolutional Neural Networks Learn Succinct Learning Algorithms

TLDR
This work exhibits a NN architecture that, in polynomial time, learns as well as any efficient learning algorithm describable by a constant-sized learning algorithm, suggesting that the synergy of Recurrent and Convolutional NNs may be more powerful than either alone.

Inductive Biases and Variable Creation in Self-Attention Mechanisms

TLDR
The main result shows that bounded-norm Transformer networks “create sparse variables”: a single self-attention head can represent a sparse function of the input sequence, with sample complexity scaling only logarithmically with the context length.

References

SHOWING 1-10 OF 57 REFERENCES

Neural Networks with Small Weights and Depth-Separation Barriers

TLDR
This paper provides a negative and constructive answer to whether there are polynomially-bounded functions which require super-polynomial weights in order to approximate with constant-depth neural networks, and proves fundamental barriers to proving such results beyond depth $4$ by reduction to open problems and natural-proof barriers in circuit complexity.

On the Computational Power of Transformers and Its Implications in Sequence Modeling

TLDR
This paper provides an alternate and simpler proof to show that vanilla Transformers are Turing-complete and proves that Transformers with only positional masking and without any positional encoding are also Turing- complete.

The Connection Between Approximation, Depth Separation and Learnability in Neural Networks

TLDR
It is shown that a necessary condition for a function to be learnable by gradient descent on deep neural networks is to be able to approximate the function, at least in a weak sense, with shallow neural networks.

RNNs Can Generate Bounded Hierarchical Languages with Optimal Memory

TLDR
Dyck- is introduced, the language of well-nested brackets and nesting depth, reflecting the bounded memory needs and long-distance dependencies of natural language syntax, and it is proved that an RNN with $O(m \log k)$ hidden units suffices, an exponential reduction in memory, by an explicit construction.

On the Computational Power of Neural Nets

TLDR
It is proved that one may simulate all Turing Machines by rational nets in linear time, and there is a net made up of about 1,000 processors which computes a universal partial-recursive function.

On the Turing Completeness of Modern Neural Network Architectures

TLDR
This study studies the computational power of two of the most paradigmatic architectures exemplifying these mechanisms: the Transformer and the Neural GPU, and shows both models to be Turing complete exclusively based on their capacity to compute and access internal dense representations of the data.

Depth-Width Trade-offs for ReLU Networks via Sharkovsky's Theorem

TLDR
A new connection between DNNs expressivity and Sharkovsky's Theorem from dynamical systems is pointed to, that enables us to characterize the depth-width trade-offs of ReLU networks for representing functions based on the presence of generalized notion of fixed points, called periodic points.

Computational Capabilities of Graph Neural Networks

TLDR
The functions that can be approximated by GNNs, in probability, up to any prescribed degree of precision are described, and includes most of the practically useful functions on graphs.

Vapnik-Chervonenkis Dimension of Recurrent Neural Networks

On the Expressive Power of Deep Learning: A Tensor Analysis

TLDR
It is proved that besides a negligible set, all functions that can be implemented by a deep network of polynomial size, require exponential size in order to be realized (or even approximated) by a shallow network.
...