How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

  title={How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers},
  author={Michael Hassid and Hao Peng and Daniel Rotem and Jungo Kasai and Ivan Montero and Noah Smith and Roy Schwartz},
The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA,1 a new probing method that replaces the input-dependent attention matrices with constant ones—the average attention weights over multiple inputs. We use PAPA to analyze several… 

Figures and Tables from this paper

Quantifying Context Mixing in Transformers

By expand-ing the scope of analysis to the whole encoder block, this work proposes Value Zeroing, a novel context mixing score customized for Transformers that provides a deeper understanding of how information is mixed at each encoder layer.



Are Sixteen Heads Really Better than One?

It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance.

Differentiable Subset Pruning of Transformer Heads

Differentiable subset pruning is introduced, a new head pruning technique that learns per- head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads via stochastic gradient descent.

Random Feature Attention

RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, is proposed and explored, showing that RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets.

Paying More Attention to Self-attention: Improving Pre-trained Language Models via Attention Guiding

This work proposes a simple yet effective attention guiding mechanism to improve the performance of PLMs through encouraging the attention towards the established goals and proposes two kinds of attention guiding methods, i.e., the attention map discrimination guiding (MDG) and the attention pattern decorrelation guiding (PDG).

Hard-Coded Gaussian Attention for Neural Machine Translation

A “hard-coded” attention variant without any learned parameters is developed, which offers insight into which components of the Transformer are actually important, which it is hoped will guide future work into the development of simpler and more efficient attention-based models.

Attention Can Reflect Syntactic Structure (If You Let It)

This study presents decoding experiments for multilingual BERT across 18 languages in order to test the generalizability of the claim that dependency syntax is reflected in attention patterns, and demonstrates full trees can be decoded above baseline accuracy from single attention heads.

Rethinking Attention with Performers

Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear space and time complexity, without relying on any priors such as sparsity or low-rankness are introduced.

Pay Attention to MLPs

This work proposes a simple attention-free network architecture, gMLP, based solely on MLPs with gating, and shows that it can perform as well as Transformers in key language and vision applications and can scale as much as Transformers over increased data and compute.

Attention is All you Need

A new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely is proposed, which generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

It is found that the most important and confident heads play consistent and often linguistically-interpretable roles and when pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, it is observed that specialized heads are last to be pruned.