Is Attention Interpretable?

  title={Is Attention Interpretable?},
  author={Sofia Serrano and Noah A. Smith},
Attention mechanisms have recently boosted performance on a range of NLP tasks. [] Key Result We conclude that while attention noisily predicts input components’ overall importance to a model, it is by no means a fail-safe indicator.1

Understanding Attention for Text Classification

A study on understanding the internal mechanism of attention by looking into the gradient update process, and proposing to analyze for each word token the following two quantities: its polarity score and its attention score, where the latter is a global assessment on the token's significance.


The proposed methods can effectively improve the interpretability of attention mechanisms on a variety of datasets and are proposed to mitigate the issue of combinatorial shortcuts in attention weights.

Why is Attention Not So Interpretable

Theoretically analyze the combinatorial shortcuts, design one intuitive experiment to demonstrate their existence, and propose two methods to mitigate this issue, which show that the proposed methods can effectively improve the interpretability of attention mechanisms on a variety of datasets.

Why is Attention Not So Attentive?

It is revealed that one root cause of this phenomenon can be ascribed to the combinatorial shortcuts, which stand for that the models may not only obtain information from the highlighted parts by attention mechanisms but from the attention weights themselves.

Improving the Faithfulness of Attention-based Explanations with Task-specific Information for Text Classification

A new family of Task-Scaling mechanisms that learn task-specific non-contextualised information to scale the original attention weights are proposed, demonstrating that TaSc consistently provides more faithful attention-based explanations compared to three widely-used interpretability techniques.

Is Sparse Attention more Interpretable?

It is observed in this setting that inducing sparsity may make it less plausible that attention can be used as a tool for understanding model behavior.

Attention Flows: Analyzing and Comparing Attention Mechanisms in Language Models

The visualization, Attention Flows, is designed to support users in querying, tracing, and comparing attention within layers, across layers, and amongst attention heads in Transformer-based language models, and to help users gain insight on how a classification decision is made.

A Song of (Dis)agreement: Evaluating the Evaluation of Explainable Artificial Intelligence in Natural Language Processing

It is argued that the community should stop using rank correlation as an evaluation metric for attention-based explanations and instead test various explanation methods and employ a human-in-the-loop process to determine if the explanations align with human intuition for the particular use case at hand.

ML Reproducibility Challenge 2020 Learning to Deceive With Attention-Based Explanations

  • Computer Science
  • 2021
This paper challenges the usage of attention-based explanation through a series of experiments using classification and sequence-to-sequence (seq2seq) models and examines the model’s use of impermissible tokens, which are user-defined tokens that can introduce bias e.g. gendered pronouns.

On Exploring Attention-based Explanation for Transformer Models in Text Classification

AGrad and RePAGrad significantly outperform existing state-of-the-art explanation methods in faithfulness and consistency, at the cost of nominal degradation on resilience compared to attention weights, and reveal that elements of a model architecture can play an important role towards explainability.



Attention is not Explanation

This work performs extensive experiments across a variety of NLP tasks to assess the degree to which attention weights provide meaningful “explanations” for predictions, and finds that they largely do not.

Interpreting Recurrent and Attention-Based Neural Models: a Case Study on Natural Language Inference

This paper proposes to interpret the intermediate layers of NLI models by visualizing the saliency of attention and LSTM gating signals and presents several examples for which their methods are able to reveal interesting insights and identify the critical information contributing to the model decisions.

Effective Approaches to Attention-based Neural Machine Translation

A global approach which always attends to all source words and a local one that only looks at a subset of source words at a time are examined, demonstrating the effectiveness of both approaches on the WMT translation tasks between English and German in both directions.

Visualizing and Understanding Neural Models in NLP

Four strategies for visualizing compositionality in neural models for NLP, inspired by similar work in computer vision, including LSTM-style gates that measure information flow and gradient back-propagation, are described.

Learning Structured Text Representations

A model that can encode a document while automatically inducing rich structural dependencies is proposed that embeds a differentiable non-projective parsing algorithm into a neural model and uses attention mechanisms to incorporate the structural biases.

Rationalizing Neural Predictions

The approach combines two modular components, generator and encoder, which are trained to operate well together and specifies a distribution over text fragments as candidate rationales and these are passed through the encoder for prediction.

Pathologies of Neural Models Make Interpretations Difficult

This work uses input reduction, which iteratively removes the least important word from the input, to expose pathological behaviors of neural models: the remaining words appear nonsensical to humans and are not the ones determined as important by interpretation methods.

Comparing Automatic and Human Evaluation of Local Explanations for Text Classification

A variety of local explanation approaches using automatic measures based on word deletion are evaluated, showing that an evaluation using a crowdsourcing experiment correlates moderately with these automatic measures and that a variety of other factors also impact the human judgements.

Explaining Predictions of Non-Linear Classifiers in NLP

This paper applies layer-wise relevance propagation for the first time to natural language processing (NLP) and uses it to explain the predictions of a convolutional neural network trained on a topic categorization task.

Understanding Neural Networks through Representation Erasure

This paper proposes a general methodology to analyze and interpret decisions from a neural model by observing the effects on the model of erasing various parts of the representation, such as input word-vector dimensions, intermediate hidden units, or input words.