Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions

  title={Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions},
  author={S. Bhattamishra and Arkil Patel and Varun Kanade and Phil Blunsom},
Despite the widespread success of Transformers on NLP tasks, recent works have found that they struggle to model several formal languages when compared to recurrent models. This raises the question of why Transformers perform well in practice and whether they have any properties that enable them to generalize better than recurrent models. In this work, we conduct an extensive empirical study on Boolean functions to demonstrate the following: (i) Random Transformers are relatively more biased… 



On the Ability and Limitations of Transformers to Recognize Formal Languages

This work systematically study the ability of Transformers to model such languages as well as the role of its individual components in doing so, and provides insights on therole of self-attention mechanism in modeling certain behaviors and the influence of positional encoding schemes on the learning and generalization abilities.

Saturated Transformers are Constant-Depth Threshold Circuits

Saturated transformers transcend the known limitations of hard-attention transformers, and it is proved saturated transformers with floating-point values can be simulated by constant-depth threshold circuits, giving the class TC0 as an upper bound on the formal languages they recognize.

Deep learning generalizes because the parameter-function map is biased towards simple functions

This paper argues that the parameter-function map of many DNNs should be exponentially biased towards simple functions, and provides clear evidence for this strong simplicity bias in a model DNN for Boolean functions, as well as in much larger fully connected and convolutional networks applied to CIFAR10 and MNIST.

Exploring Length Generalization in Large Language Models

It is shown that combining pretrained large language models’ in-context learning abilities with scratchpad prompting results in a dramatic improvement in length generalization, and is run to identify common sources of mistakes that highlight opportunities in equipping language models with the ability to generalize to longer problems.

Theoretical Limitations of Self-Attention in Neural Sequence Models

Across both soft and hard attention, strong theoretical limitations are shown of the computational abilities of self-attention, finding that it cannot model periodic finite-state languages, nor hierarchical structure, unless the number of layers or heads increases with input length.

Sensitivity as a Complexity Measure for Sequence Classification Tasks

A novel extension of the theory of Boolean function sensitivity is introduced, finding that sensitivity is higher on challenging tasks collected in GLUE than on simple text classification tasks, and that sensitivity predicts the performance both of simple lexical classifiers and of vanilla BiLSTMs without pretrained contextualized embeddings.

Neural Networks and the Chomsky Hierarchy

It is demonstrated that grouping tasks according to the Chomsky hierarchy allows us to forecast whether certain architectures will be able to generalize to out-of-distribution inputs, including negative results where even extensive amounts of data and training time never lead to any non-trivial generalization.

On Evaluating the Generalization of LSTM Models in Formal Languages

This paper empirically evaluates the inductive learning capabilities of Long Short-Term Memory networks, a popular extension of simple RNNs, to learn simple formal languages.

Overcoming a Theoretical Limitation of Self-Attention

This work settles an open question by constructing a transformer that recognizes PARITY with perfect accuracy, and similarly for FIRST, and uses layer normalization to bring the cross-entropy of both models arbitrarily close to zero.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.