Self-Attention Networks Can Process Bounded Hierarchical Languages

@inproceedings{Yao2021SelfAttentionNC,
  title={Self-Attention Networks Can Process Bounded Hierarchical Languages},
  author={Shunyu Yao and Binghui Peng and Christos H. Papadimitriou and Karthik Narasimhan},
  booktitle={ACL},
  year={2021}
}
Despite their impressive performance in NLP, self-attention networks were recently proved to be limited for processing formal languages with hierarchical structure, such as Dyck-k, the language consisting of well-nested parentheses of k types. This suggested that natural language can be approximated well with models that are too weak for formal languages, or that the role of hierarchy and recursion in natural language might be limited. We qualify this implication by proving that self-attention… 

Figures and Tables from this paper

Implicit n-grams Induced by Recurrence

TLDR
This work presents a study that shows there actually exist some explainable componentsthat reside within the hidden states, which are reminiscent of the classical n-grams features, which could add interpretability to RNN architectures, and also provide inspirations for proposing new architectures for sequential data.

Formal Language Recognition by Hard Attention Transformers: Perspectives from Circuit Complexity

TLDR
It is shown that UHAT and GUHAT Transformers, viewed as string acceptors, can only recognize formal languages in the complexity class AC0, the class of languages recognizable by families of Boolean circuits of constant depth and polynomial size, while the non-AC0 languages MAJORITY and DYCK-1 are recognizable by AHAT networks, implying that AHAT can recognize languages that UhAT andGUHAT cannot.

On the Power of Saturated Transformers: A View from Circuit Complexity

TLDR
This work analyzes the circuit complexity of transformers with saturated attention: a generalization of hard attention that more closely captures the attention patterns learnable in practical transformers and shows that saturated transformers transcend the limitations of hard-attention transformers.

Thinking Like Transformers

TLDR
This paper proposes a computational model for the transformer-encoder in the form of a programming language, the Restricted Access Sequence Processing Language (RASP), and provides RASP programs for histograms, sorting, and Dyck-languages.

Saturated Transformers are Constant-Depth Threshold Circuits

TLDR
Saturated transformers transcend the known limitations of hard-attention transformers, and it is proved saturated transformers with floating-point values can be simulated by constant-depth threshold circuits, giving the class TC0 as an upper bound on the formal languages they recognize.

What Can Transformers Learn In-Context? A Case Study of Simple Function Classes

TLDR
It is shown empirically that standard Transformers can be trained from scratch to perform in-context learning of linear functions—that is, the trained model is able to learn unseen linear functions from in- context examples with performance comparable to the optimal least squares estimator.

References

SHOWING 1-10 OF 52 REFERENCES

On the Ability of Self-Attention Networks to Recognize Counter Languages

TLDR
This work systematically study the ability of Transformers to model such languages as well as the role of its individual components in doing so and the influence of positional encoding schemes on the learning and generalization ability of the model.

RNNs Can Generate Bounded Hierarchical Languages with Optimal Memory

TLDR
Dyck- is introduced, the language of well-nested brackets and nesting depth, reflecting the bounded memory needs and long-distance dependencies of natural language syntax, and it is proved that an RNN with $O(m \log k)$ hidden units suffices, an exponential reduction in memory, by an explicit construction.

How Can Self-Attention Networks Recognize Dyck-n Languages?

TLDR
The results show that SA+ is able to generalize to longer sequences and deeper dependencies, and finds attention maps learned by SA+ to be amenable to interpretation and compatible with a stack-based language recognizer.

Theoretical Limitations of Self-Attention in Neural Sequence Models

TLDR
Across both soft and hard attention, strong theoretical limitations are shown of the computational abilities of self-attention, finding that it cannot model periodic finite-state languages, nor hierarchical structure, unless the number of layers or heads increases with input length.

Assessing the Ability of Self-Attention Networks to Learn Word Order

TLDR
Experimental results reveal that: 1) SAN trained on word reordering detection indeed has difficulty learning the positional information even with the position embedding; and 2) SANtrained on machine translation learns better positional information than its RNN counterpart, in which position embeddedding plays a critical role.

Can Recurrent Neural Networks Learn Nested Recursion?

TLDR
This paper investigates experimentally the capability of several recurrent neural networks (RNNs) to learn nested recursion, and measures an upper bound of their capability to do so by simplifying the task to learning a generalized Dyck language, namely one composed of matching parentheses of various kinds.

A Formal Hierarchy of RNN Architectures

TLDR
It is hypothesized that the practical learnable capacity of unsaturated RNNs obeys a similar hierarchy, and empirical results to support this conjecture are provided.

Evaluating the Ability of LSTMs to Learn Context-Free Grammars

TLDR
It is concluded that LSTMs do not learn the relevant underlying context-free rules, suggesting the good overall performance is attained rather by an efficient way of evaluating nuisance variables.

A Recurrent Network that performs a Context-Sensitive Prediction Task

TLDR
This work shows that, at least from a representational point of view, connectionist architectures can handle more complex formal languages than was previously known.

Learning the Dyck Language with Attention-based Seq2Seq Models

TLDR
It is revealed that attention mechanisms still cannot truly generalize over the recursion depth, although they perform much better than other models on the closing bracket tagging task, which suggests that this commonly used task is not sufficient to test a model’s understanding of CFGs.
...