Corpus ID: 235421630

Thinking Like Transformers

  title={Thinking Like Transformers},
  author={Gail Weiss and Yoav Goldberg and Eran Yahav},
What is the computational model behind a Transformer? Where recurrent neural networks have direct parallels in finite state machines, allowing clear discussion and thought around architecture variants or trained models, Transformers have no such familiar parallel. In this paper we aim to change that, proposing a computational model for the transformer-encoder in the form of a programming language. We map the basic components of a transformer-encoder—attention and feed-forward computation—into… Expand
On the Power of Saturated Transformers: A View from Circuit Complexity
This work analyzes the circuit complexity of transformers with saturated attention: a generalization of hard attention that more closely captures the attention patterns learnable in practical transformers and shows that saturated transformers transcend the limitations of hard-attention transformers. Expand
The Neural Data Router: Adaptive Control Flow in Transformers Improves Systematic Generalization
This novel Neural Data Router (NDR) achieves 100% length generalization accuracy on the classic compositional table lookup task, as well as near-perfect accuracy onThe simple arithmetic task and a new variant of ListOps testing for generalization across computational depth. Expand


Universal Transformers
The Universal Transformer (UT), a parallel-in-time self-attentive recurrent sequence model which can be cast as a generalization of the Transformer model and which addresses issues of parallelizability and global receptive field, is proposed. Expand
Attention is All you Need
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
Can Recurrent Neural Networks Learn Nested Recursion?
This paper investigates experimentally the capability of sev- eral recurrent neural networks (RNNs) to learn nested recursion, and measures an upper bound of their capability to do so, by simplifying the task to learning a generalized Dyck language, namely one composed of matching parentheses of various kinds. Expand
Efficient Transformers: A Survey
This paper characterizes a large and thoughtful selection of recent efficiency-flavored "X-former" models, providing an organized and comprehensive overview of existing work and models across multiple domains. Expand
Improving Transformer Models by Reordering their Sublayers
This work proposes a new transformer pattern that adheres to this property, the sandwich transformer, and shows that it improves perplexity on multiple word-level and character-level language modeling benchmarks, at no cost in parameters, memory, or training time. Expand
ETC: Encoding Long and Structured Data in Transformers
A new family of Transformer models is presented, which is called the Extended Transformer Construction (ETC), that allows for significant increases in input sequence length by introducing a new globallocal attention mechanism between a global memory and the standard input tokens. Expand
Theoretical Limitations of Self-Attention in Neural Sequence Models
Across both soft and hard attention, strong theoretical limitations are shown of the computational abilities of self-attention, finding that it cannot model periodic finite-state languages, nor hierarchical structure, unless the number of layers or heads increases with input length. Expand
Generating Long Sequences with Sparse Transformers
This paper introduces sparse factorizations of the attention matrix which reduce this to $O(n)$, and generates unconditional samples that demonstrate global coherence and great diversity, and shows it is possible in principle to use self-attention to model sequences of length one million or more. Expand
Are Transformers universal approximators of sequence-to-sequence functions?
It is established that Transformer models are universal approximators of continuous permutation equivariant sequence-to-sequence functions with compact support, which is quite surprising given the amount of shared parameters in these models. Expand