• Corpus ID: 239024804

Inductive Biases and Variable Creation in Self-Attention Mechanisms

@article{Edelman2021InductiveBA,
  title={Inductive Biases and Variable Creation in Self-Attention Mechanisms},
  author={Benjamin L. Edelman and Surbhi Goel and Sham M. Kakade and Cyril Zhang},
  journal={ArXiv},
  year={2021},
  volume={abs/2110.10090}
}
Self-attention, an architectural motif designed to model long-range interactions in sequential data, has driven numerous recent breakthroughs in natural language processing and beyond. This work provides a theoretical analysis of the inductive biases of self-attention modules, where our focus is to rigorously establish which functions and long-range dependencies self-attention blocks prefer to represent. Our main result shows that bounded-norm Transformer layers create sparse variables: they… 

Figures from this paper

References

SHOWING 1-10 OF 49 REFERENCES
Infinite attention: NNGP and NTK for deep attention networks
TLDR
A rigorous extension of results to NNs involving attention layers is provided, showing that unlike single- head attention, which induces non-Gaussian behaviour, multi-head attention architectures behave as GPs as the number of heads tends to infinity.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
Attention is All you Need
TLDR
A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.
What Does BERT Look at? An Analysis of BERT’s Attention
TLDR
It is shown that certain attention heads correspond well to linguistic notions of syntax and coreference, and an attention-based probing classifier is proposed and used to demonstrate that substantial syntactic information is captured in BERT’s attention.
Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
TLDR
An attention based model that automatically learns to describe the content of images is introduced that can be trained in a deterministic manner using standard backpropagation techniques and stochastically by maximizing a variational lower bound.
Fantastic Generalization Measures and Where to Find Them
TLDR
This work presents the first large scale study of generalization in deep networks, investigating more then 40 complexity measures taken from both theoretical bounds and empirical studies and showing surprising failures of some measures as well as promising measures for further research.
Approximating How Single Head Attention Learns
TLDR
This work approximate model training as a two stage process: early on in training when the attention weights are uniform, the model learns to translate individual input word i to o if they cooccur frequently, and later, while the correct output is o because it knows i translates to o.
On Generalization Bounds of a Family of Recurrent Neural Networks
TLDR
This work studies the generalization properties of vanilla RNNs as well as their variants, including Minimal Gated Unit (MGU), Long Short Term Memory (LSTM), and Convolutional (Conv) Rnns, and establishes refined generalization bounds with additional norm assumptions.
BERT Rediscovers the Classical NLP Pipeline
TLDR
This work finds that the model represents the steps of the traditional NLP pipeline in an interpretable and localizable way, and that the regions responsible for each step appear in the expected sequence: POS tagging, parsing, NER, semantic roles, then coreference.
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
...
1
2
3
4
5
...