• Corpus ID: 244117341

"Will You Find These Shortcuts?" A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification

@article{Bastings2021WillYF,
  title={"Will You Find These Shortcuts?" A Protocol for Evaluating the Faithfulness of Input Salience Methods for Text Classification},
  author={Jasmijn Bastings and Sebastian Ebert and Polina Zablotskaia and Anders Sandholm and Katja Filippova},
  journal={ArXiv},
  year={2021},
  volume={abs/2111.07367}
}
Feature attribution a.k.a. input salience methods which assign an importance score to a feature are abundant but may produce surprisingly different results for the same model on the same input. While differences are expected if disparate definitions of importance are assumed, most methods claim to provide faithful attributions and point at the features most relevant for a model’s prediction. Existing work on faithfulness evaluation is not conclusive and does not provide a clear answer as to how… 

Figures and Tables from this paper

Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining
TLDR
This work adapt and improve a recently proposed faithfulness benchmark from computer vision called ROAR, by Hooker et al. (2019), and proposes a scalar faithfulness metric, which makes it easy to compare results across papers.
The Solvability of Interpretability Evaluation Metrics
TLDR
This paper presents a series of investigations showing that this beam search explainer is generally comparable or favorable to current choices such as LIME and SHAP, suggest rethinking the goals of model interpretability, and identify several directions towards better evaluations of new method proposals.
When less is more: Simplifying inputs aids neural network understanding
TLDR
This work measures simplicity with the encoding bit size given by a pretrained generative model, and minimize the bit size to simplify inputs in training and inference and investigates the effect of such simplification in several scenarios: conventional training, dataset condensation and post-hoc explanations.
Learning to Scaffold: Optimizing Model Explanations for Teaching
TLDR
This work trains models on three natural language processing and computer vision tasks, and finds that students trained with explanations extracted with this framework are able to simulate the teacher more effectively than ones produced with previous methods.
Diagnosing AI Explanation Methods with Folk Concepts of Behavior
When explaining AI behavior to humans, how is the communicated information being comprehended by the human explainee, and does it match what the explanation attempted to communicate? When can we say
Measuring the Mixing of Contextual Information in the Transformer
TLDR
This paper considers the whole attention block, and defines a metric to measure token-to-token interactions within each layer, considering the characteristics of the representation space, and aggregate layer-wise interpretations to provide input attribution scores for model predictions.
Post-hoc Interpretability for Neural NLP: A Survey
TLDR
This survey provides a categorization of how recent post-hoc interpretability methods communicate explanations to humans, and discusses each method in-depth, and how they are validated, as the latter is often a common concern.

References

SHOWING 1-10 OF 59 REFERENCES
Evaluating the Faithfulness of Importance Measures in NLP by Recursively Masking Allegedly Important Tokens and Retraining
TLDR
This work adapt and improve a recently proposed faithfulness benchmark from computer vision called ROAR, by Hooker et al. (2019), and proposes a scalar faithfulness metric, which makes it easy to compare results across papers.
Do Feature Attribution Methods Correctly Attribute Features?
TLDR
This work evaluates three feature attribution methods: saliency maps, rationales, and attention and identifies their deficiencies and adds a new perspective to the growing body of evidence questioning their correctness and reliability in the wild.
Data Staining: A Method for Comparing Faithfulness of Explainers
TLDR
A new evaluation method is proposed, Data Staining, that trains a stained predictor and evaluates the explainer’s ability to recover the stain and shows that the greedy explainer consistently outperformed other more complex explainers on black-box models for the authors' selected class of stains.
A Diagnostic Study of Explainability Techniques for Text Classification
TLDR
A comprehensive list of diagnostic properties for evaluating existing explainability techniques is developed and it is found that the gradient-based explanations perform best across tasks and model architectures, and further insights into the properties are presented.
Towards Faithfully Interpretable NLP Systems: How Should We Define and Evaluate Faithfulness?
TLDR
The current binary definition for faithfulness sets a potentially unrealistic bar for being considered faithful, and is called for discarding the binary notion of faithfulness in favor of a more graded one, which is of greater practical utility.
Evaluating Saliency Methods for Neural Language Models
TLDR
Through the evaluation, various ways saliency methods could yield interpretations of low quality are identified, and it is recommended that future work deploying such methods to neural language models should carefully validate their interpretations before drawing insights.
The elephant in the interpretability room: Why use attention as explanation when we have saliency methods?
TLDR
It is argued that input saliency methods are better suited, and that there are no compelling reasons to use attention, despite the coincidence that it provides a weight for each input.
Combining Feature and Instance Attribution to Detect Artifacts
TLDR
This paper proposes new hybrid approaches that combine saliency maps (which highlight important input features) with instance attribution methods (which retrieve training samples influential to a given prediction) and shows that this proposed training-feature attribution can be used to efficiently uncover artifacts in training data when a challenging validation set is available.
Benchmarking Attribution Methods with Relative Feature Importance
TLDR
This work proposes a framework for Benchmarking Attribution Methods (BAM) with a priori knowledge of relative feature importance and suggests that certain methods are more likely to produce false positive explanations---features that are incorrectly attributed as more important to model prediction.
"Why Should I Trust You?": Explaining the Predictions of Any Classifier
TLDR
LIME is proposed, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning aninterpretable model locally varound the prediction.
...
1
2
3
4
5
...