Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions

  title={Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions},
  author={Xiaochuang Han and Byron C. Wallace and Yulia Tsvetkov},
Modern deep learning models for NLP are notoriously opaque. This has motivated the development of methods for interpreting such models, e.g., via gradient-based saliency maps or the visualization of attention weights. Such approaches aim to provide explanations for a particular model prediction by highlighting important words in the corresponding input text. While this might be useful for tasks where decisions are explicitly influenced by individual tokens in the input, we suspect that such… 

Figures and Tables from this paper

Combining Feature and Instance Attribution to Detect Artifacts

This paper proposes new hybrid approaches that combine saliency maps (which highlight important input features) with instance attribution methods (which retrieve training samples influential to a given prediction) and shows that this proposed training-feature attribution can be used to efficiently uncover artifacts in training data when a challenging validation set is available.

An Empirical Comparison of Instance Attribution Methods for NLP

It is found that simple retrieval methods yield training instances that differ from those identified via gradient-based methods (such as IFs), but that nonetheless exhibit desirable characteristics similar to more complex attribution methods.

Explaining and Improving Model Behavior with k Nearest Neighbor Representations

This work proposes using k nearest neighbor (kNN) representations to identify training examples responsible for a model's predictions and obtains a corpus-level understanding of the model's behavior, and shows that the kNN approach makes the finetuned model more robust to adversarial inputs.

An Investigation of Language Model Interpretability via Sentence Editing

A sentence editing dataset is re-purpose, where faithful high-quality human rationales can be automatically extracted and compared with extracted model rationales as a new testbed for interpretability, to conduct a systematic investigation on PLMs’ interpretability.

Dissecting Generation Modes for Abstractive Summarization Models via Ablation and Attribution

This work proposes a two-step method to interpret summarization model decisions and demonstrates its capability to identify phrases the summary model has memorized and determine where in the training pipeline this memorization happened, as well as study complex generation phenomena like sentence fusion on a per-instance basis.

Interpreting Text Classifiers by Learning Context-sensitive Influence of Words

This work proposes MOXIE (MOdeling conteXt-sensitive InfluencE of words) with an aim to enable a richer interface for a user to interact with the model being interpreted and to produce testable predictions, and aims to make predictions for importance scores, counterfactuals and learned biases withMOXIE.

Interpreting Deep Learning Models in Natural Language Processing: A Review

In this survey, a comprehensive review of various interpretation methods for neural models in NLP is provided, including a high-level taxonomy for interpretation methods in N LP, and points out deficiencies of current methods and suggest some avenues for future research.

Pair the Dots: Jointly Examining Training History and Test Stimuli for Model Interpretability

This paper proposes an efficient and differentiable approach to make it feasible to interpret a model's prediction by jointly examining training history and test stimuli, and demonstrates that the proposed methodology offers clear explanations about neural model decisions, along with being useful for performing error analysis, crafting adversarial examples and fixing erroneously classified examples.

On Sample Based Explanation Methods for Sequence-to-Sequence Applications

  • Yun QinFan Zhang
  • Computer Science
    2022 7th International Conference on Computational Intelligence and Applications (ICCIA)
  • 2022
This work proposes a matching influence function TracInS by selecting representative sequence-to-sequence applications that require high interpretability according to the needs of people who want to understand model behavior, and designs enhancement based on Trac inS, using arbitrary spans as fine-grained explanation units to achieve interpretability.

HILDIF: Interactive Debugging of NLI Models Using Influence Functions

A novel explanatory debugging pipeline called HILDIF is proposed, enabling humans to improve deep text classifiers using influence functions as an explanation method, and can effectively alleviate artifact problems in fine-tuned BERT models and result in increased model generalizability.



Towards Explainable NLP: A Generative Explanation Framework for Text Classification

A novel generative explanation framework that learns to make classification decisions and generate fine-grained explanations at the same time and introduces the explainable factor and the minimum risk training approach that learn to generate more reasonable explanations.

Pathologies of Neural Models Make Interpretations Difficult

This work uses input reduction, which iteratively removes the least important word from the input, to expose pathological behaviors of neural models: the remaining words appear nonsensical to humans and are not the ones determined as important by interpretation methods.

Learning to Faithfully Rationalize by Construction

Variations of this simple framework yield predictive performance superior to ‘end-to-end’ approaches, while being more general and easier to train.

AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models

This work introduces AllenNLP Interpret, a flexible framework for interpreting NLP models, which provides interpretation primitives for anyAllenNLP model and task, a suite of built-in interpretation methods, and a library of front-end visualization components.

Attention is not Explanation

This work performs extensive experiments across a variety of NLP tasks to assess the degree to which attention weights provide meaningful “explanations” for predictions, and finds that they largely do not.

Visualizing and Understanding Neural Models in NLP

Four strategies for visualizing compositionality in neural models for NLP, inspired by similar work in computer vision, including LSTM-style gates that measure information flow and gradient back-propagation, are described.

Attention is not not Explanation

It is shown that even when reliable adversarial distributions can be found, they don’t perform well on the simple diagnostic, indicating that prior work does not disprove the usefulness of attention mechanisms for explainability.

Topics to Avoid: Demoting Latent Confounds in Text Classification

This work proposes a method that represents the latent topical confounds and a model which “unlearns” confounding features by predicting both the label of the input text and the confound; but it shows that this model generalizes better and learns features that are indicative of the writing style rather than the content.

Learning Important Features Through Propagating Activation Differences

DeepLIFT (Deep Learning Important FeaTures), a method for decomposing the output prediction of a neural network on a specific input by backpropagating the contributions of all neurons in the network to every feature of the input, is presented.

Understanding Black-box Predictions via Influence Functions

This paper uses influence functions — a classic technique from robust statistics — to trace a model's prediction through the learning algorithm and back to its training data, thereby identifying training points most responsible for a given prediction.