• Corpus ID: 13193974

Understanding Black-box Predictions via Influence Functions

@article{Koh2017UnderstandingBP,
  title={Understanding Black-box Predictions via Influence Functions},
  author={Pang Wei Koh and Percy Liang},
  journal={ArXiv},
  year={2017},
  volume={abs/1703.04730}
}
How can we explain the predictions of a black-box model. [] Key Result On linear models and convolutional neural networks, we demonstrate that influence functions are useful for multiple purposes: understanding model behavior, debugging models, detecting dataset errors, and even creating visually-indistinguishable training-set attacks.

Figures from this paper

Explaining Deep Learning Models - A Bayesian Non-parametric Approach
TLDR
The empirical results indicate that the proposed approach not only outperforms the state-of-the-art techniques in explaining individual decisions but also provides users with an ability to discover the vulnerabilities of the target ML models.
Right for Better Reasons: Training Differentiable Models by Constraining their Influence Functions
TLDR
This work demonstrates how to make use of influence functions—a well known robust statistic—in the constraints to correct the model’s behaviour more effectively and boosts the quality of explanations at inference time compared to input gradients.
Interpreting Black Box Predictions using Fisher Kernels
TLDR
This work takes a novel look at black box interpretation of test predictions in terms of training examples, making use of Fisher kernels as the defining feature embedding of each data point, combined with Sequential Bayesian Quadrature (SBQ) for efficient selection of examples.
Using Cross-Loss Influence Functions to Explain Deep Network Representations
TLDR
This work provides the first theoretical and empirical demonstration that influence functions can be extended to handle mismatched training and testing settings and enables us to compute the influence of unsupervised and self-supervised training examples with respect to a supervised test objective.
Learning to Abstain via Curve Optimization
TLDR
This work develops a novel approach to the problem of selecting a budget-constrained subset of test examples to abstain on, by analytically optimizing the expected marginal improvement in a desired performance metric, such as the area under the ROC curve or Precision-Recall curve.
Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions
TLDR
It is found that influence functions are particularly useful for natural language inference, a task in which ‘saliency maps’ may not have clear interpretation, and a new quantitative measure based on influence functions that can reveal artifacts in training data is developed.
Infuence functions in Machine Learning tasks
TLDR
This work extends the influence functions framework to cover more Machine Learning tasks, so that they can be used more widely in this field to understand and improve training and performance.
Rethinking Influence Functions of Neural Networks in the Over-parameterized Regime
TLDR
The neural tangent kernel (NTK) theory is utilized to calculate influence function (IF) for the neural network trained with regularized mean-square loss, and it is proved that the approximation error can be arbitrarily small when the width is sufficiently large for two-layer ReLU networks.
Transparent Interpretation with Knockouts
TLDR
A new model-agnostic algorithm is proposed to identify a minimum number of training samples that are indispensable for a given model decision at a particular test point, as the model decision would otherwise change upon the removal of these training samples.
Adapting and Evaluating Influence-Estimation Methods for Gradient-Boosted Decision Trees
TLDR
BoostIn is an efficient influence-estimation method for GBDTs that performs equally well or better than existing work while being four orders of magnitude faster.
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 61 REFERENCES
Auditing Black-box Models by Obscuring Features
TLDR
A class of techniques originally developed for the detection and repair of disparate impact in classification models can be used to study the sensitivity of any model with respect to any feature subsets, and does not require the black-box model to be retrained.
Explaining and Harnessing Adversarial Examples
TLDR
It is argued that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature, supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets.
Adam: A Method for Stochastic Optimization
TLDR
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.
DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks
TLDR
The DeepFool algorithm is proposed to efficiently compute perturbations that fool deep networks, and thus reliably quantify the robustness of these classifiers, and outperforms recent methods in the task of computing adversarial perturbation and making classifiers more robust.
Rethinking the Inception Architecture for Computer Vision
TLDR
This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.
"Why Should I Trust You?": Explaining the Predictions of Any Classifier
TLDR
LIME is proposed, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning aninterpretable model locally varound the prediction.
Not Just a Black Box: Learning Important Features Through Propagating Activation Differences
TLDR
DeepLIFT (Learning Important FeaTures), an efficient and effective method for computing importance scores in a neural network that compares the activation of each neuron to its 'reference activation' and assigns contribution scores according to the difference.
“Influence sketching”: Finding influential samples in large-scale regressions
TLDR
A new scalable version of Cook's distance is developed, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions), and a new algorithm which is called “influence sketching” is introduced, which can reliably and successfully discover influential samples.
Understanding Neural Networks through Representation Erasure
TLDR
This paper proposes a general methodology to analyze and interpret decisions from a neural model by observing the effects on the model of erasing various parts of the representation, such as input word-vector dimensions, intermediate hidden units, or input words.
Deep learning via Hessian-free optimization
TLDR
A 2nd-order optimization method based on the "Hessian-free" approach is developed, and applied to training deep auto-encoders, and results superior to those reported by Hinton & Salakhutdinov (2006) are obtained.
...
1
2
3
4
5
...