How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers

  title={How Much Does Attention Actually Attend? Questioning the Importance of Attention in Pretrained Transformers},
  author={Michael Hassid and Hao Peng and Daniel Rotem and Jungo Kasai and Ivan Montero and Noah Smith and Roy Schwartz},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
The attention mechanism is considered the backbone of the widely-used Transformer architecture. It contextualizes the input by computing input-specific attention matrices. We find that this mechanism, while powerful and elegant, is not as important as typically thought for pretrained language models. We introduce PAPA, a new probing method that replaces the input-dependent attention matrices with constant ones -- the average attention weights over multiple inputs. We use PAPA to analyze several… 

Figures and Tables from this paper

Quantifying Context Mixing in Transformers

By expanding the scope of analysis to the whole encoder block, this work proposes Value Zeroing, a novel context mixing score customized for Transformers that provides a deeper understanding of how information is mixed at each encoder layer.



Are Sixteen Heads Really Better than One?

It is made the surprising observation that even if models have been trained using multiple heads, in practice, a large percentage of attention heads can be removed at test time without significantly impacting performance.

Differentiable Subset Pruning of Transformer Heads

Differentiable subset pruning is introduced, a new head pruning technique that learns per- head importance variables and then enforces a user-specified hard constraint on the number of unpruned heads via stochastic gradient descent.

Random Feature Attention

RFA, a linear time and space attention that uses random feature methods to approximate the softmax function, is proposed and explored, showing that RFA is competitive in terms of both accuracy and efficiency on three long text classification datasets.

Paying More Attention to Self-attention: Improving Pre-trained Language Models via Attention Guiding

This work proposes a simple yet effective attention guiding mechanism to improve the performance of PLMs through encouraging the attention towards the established goals and proposes two kinds of attention guiding methods, i.e., the attention map discrimination guiding (MDG) and the attention pattern decorrelation guiding (PDG).

Rethinking Attention with Performers

Performers, Transformer architectures which can estimate regular (softmax) full-rank-attention Transformers with provable accuracy, but using only linear space and time complexity, without relying on any priors such as sparsity or low-rankness are introduced.

Attention is not Explanation

This work performs extensive experiments across a variety of NLP tasks to assess the degree to which attention weights provide meaningful “explanations” for predictions, and finds that they largely do not.

Hard-Coded Gaussian Attention for Neural Machine Translation

A “hard-coded” attention variant without any learned parameters is developed, which offers insight into which components of the Transformer are actually important, which it is hoped will guide future work into the development of simpler and more efficient attention-based models.

Pay Attention to MLPs

This work proposes a simple attention-free network architecture, gMLP, based solely on MLPs with gating, and shows that it can perform as well as Transformers in key language and vision applications and can scale as much as Transformers over increased data and compute.

Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

It is found that the most important and confident heads play consistent and often linguistically-interpretable roles and when pruning heads using a method based on stochastic gates and a differentiable relaxation of the L0 penalty, it is observed that specialized heads are last to be pruned.

DeBERTa: Decoding-enhanced BERT with Disentangled Attention

A new model architecture DeBERTa (Decoding-enhanced BERT with disentangled attention) is proposed that improves the BERT and RoBERTa models using two novel techniques that significantly improve the efficiency of model pre-training and performance of downstream tasks.