• Corpus ID: 240070792

The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations

@inproceedings{Hase2021TheOP,
  title={The Out-of-Distribution Problem in Explainability and Search Methods for Feature Importance Explanations},
  author={Peter Hase and Harry Xie and Mohit Bansal},
  booktitle={NeurIPS},
  year={2021}
}
Feature importance (FI) estimates are a popular form of explanation, and they are commonly created and evaluated by computing the change in model confidence caused by removing certain input features at test time. For example, in the standard Sufficiency metric, only the top-k most important tokens are kept. In this paper, we study several under-explored dimensions of FI explanations, providing conceptual and empirical improvements for this form of explanation. First, we advance a new argument… 

Figures and Tables from this paper

Explanation-based Counterfactual Retraining(XCR): A Calibration Method for Black-box Models
TLDR
EXplanationbased Counterfactual Retraining (XCR) applies the explanations generated by the XAI model as counterfactual input to retrain the black-box model to address OOD and social misalignment problems and beats current OOD calibrations on the OOD calibration metric if calibration on the validation set is applied.
Rethinking Attention-Model Explainability through Faithfulness Violation Test
Attention mechanisms are dominating the explainability of deep models. They produce probability distributions over the input, which are widely deemed as feature-importance indicators. How-ever, in
Quantus: An Explainable AI Toolkit for Responsible Evaluation of Neural Network Explanations
TLDR
Quantus is a comprehensive, open-source toolkit in Python that includes a growing, well-organised collection of evaluation metrics and tutorials for evaluating explainable methods.
Alignment Rationale for Query-Document Relevance
TLDR
This paper studies how the input perturbations can be used to infer or evaluate alignments between the query and document spans, which best explain the black-box ranker's relevance prediction.
VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives
TLDR
It is shown that model FI supervision can meaningfully improve VQA model accuracy as well as performance on several Right-for-the-Right-Reason (RRR) metrics by optimizing for four key model objectives: accurate predictions given limited but sufficient information; max-entropy predictions given no important information; invariance of predictions to changes in unimportant features; and alignment between model FI explanations and human FI explanations.
BASED-XAI: Breaking Ablation Studies Down for Explainable Artificial Intelligence
TLDR
This work aims to show how varying perturbations and adding simple guardrails can help to avoid potentially flawed conclusions, how treatment of categorical variables is an important consideration in both post-hoc explainability and ablation studies, and how to identify useful baselines for XAI methods and viable perturbated studies.
Learning Unsupervised Hierarchies of Audio Concepts
TLDR
This paper proposes a method to learn numerous music concepts from audio and then automatically hierarchise them to expose their mutual relationships, and shows that the mined hierarchies are aligned with both ground-truth hierarchies of concepts – when available – and with proxy sources of concept similarity in the general case.
Mediators: Conversational Agents Explaining NLP Model Behavior
TLDR
Desiderata for Mediators, textbased conversational agents which are capable of explaining the behavior of neural models interactively using natural language, are established from the perspective of natural language processing research.
Order-sensitive Shapley Values for Evaluating Conceptual Soundness of NLP Models
TLDR
A new explanation method for sequential data: Order-sensitive Shapley Values (OSV), which shows that OSV is more faithful in explaining model behavior than gradient-based methods and discovers that not all sentiment analysis models learn negation properly.
Necessity and Sufficiency for Explaining Text Classifiers: A Case Study in Hate Speech Detection
TLDR
A novel feature attribution method for explaining text classifiers is presented, and it is shown that different values of necessity and sufficiency for identity terms correspond to different kinds of false positive errors, exposing sources of classifier bias against marginalized groups.
...
...

References

SHOWING 1-10 OF 76 REFERENCES
Evaluating Explainable AI: Which Algorithmic Explanations Help Users Predict Model Behavior?
TLDR
Human subject tests are carried out that are the first of their kind to isolate the effect of algorithmic explanations on a key aspect of model interpretability, simulatability, while avoiding important confounding experimental factors.
Explaining machine learning classifiers through diverse counterfactual explanations
TLDR
This work proposes a framework for generating and evaluating a diverse set of counterfactual explanations based on determinantal point processes, and provides metrics that enable comparison ofcounterfactual-based methods to other local explanation methods.
"Why Should I Trust You?": Explaining the Predictions of Any Classifier
TLDR
LIME is proposed, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning aninterpretable model locally varound the prediction.
Discretized Integrated Gradients for Explaining Language Models
TLDR
Discretized Integrated Gradients (DIG) is proposed, which allows effective attribution along non-linear interpolation paths and develops two interpolation strategies for the discrete word embedding space that generates interpolation points that lie close to actual words in the embedded space, yielding more faithful gradient computation.
Have We Learned to Explain?: How Interpretability Methods Can Learn to Encode Predictions in their Interpretations
TLDR
Eval-X is introduced as a method to quantitatively evaluate interpretations and REAL-X as an amortized explanation method, which learn a predictor model that approximates the true data generating distribution given any subset of the input.
Attention is not not Explanation
TLDR
It is shown that even when reliable adversarial distributions can be found, they don’t perform well on the simple diagnostic, indicating that prior work does not disprove the usefulness of attention mechanisms for explainability.
Interpretable Neural Predictions with Differentiable Binary Variables
TLDR
This work proposes a latent model that mixes discrete and continuous behaviour allowing at the same time for binary selections and gradient-based training without REINFORCE, and can tractably compute the expected value of penalties such as L0, which allows it to directly optimise the model towards a pre-specified text selection rate.
On Baselines for Local Feature Attributions
TLDR
This paper shows empirically that baselines can significantly alter the discriminative power of feature attributions, and proposes a new taxonomy of baseline methods for tabular data.
How Do Decisions Emerge across Layers in Neural Models? Interpretation with Differentiable Masking
TLDR
Differentiable Masking relies on learning sparse stochastic gates to completely mask-out subsets of the input while maintaining end-to-end differentiability and is used to study BERT models on sentiment classification and question answering.
Attention is not Explanation
TLDR
This work performs extensive experiments across a variety of NLP tasks to assess the degree to which attention weights provide meaningful “explanations” for predictions, and finds that they largely do not.
...
...