• Corpus ID: 13193974

Understanding Black-box Predictions via Influence Functions

  title={Understanding Black-box Predictions via Influence Functions},
  author={Pang Wei Koh and Percy Liang},
How can we explain the predictions of a black-box model. [] Key Result On linear models and convolutional neural networks, we demonstrate that influence functions are useful for multiple purposes: understanding model behavior, debugging models, detecting dataset errors, and even creating visually-indistinguishable training-set attacks.

Figures from this paper

Influence Functions in Deep Learning Are Fragile

It is suggested that in general influence functions in deep learning are fragile and call for developing improved influence estimation methods to mitigate these issues in non-convex setups.

Explaining Deep Learning Models - A Bayesian Non-parametric Approach

The empirical results indicate that the proposed approach not only outperforms the state-of-the-art techniques in explaining individual decisions but also provides users with an ability to discover the vulnerabilities of the target ML models.

Interpreting Black Box Predictions using Fisher Kernels

This work takes a novel look at black box interpretation of test predictions in terms of training examples, making use of Fisher kernels as the defining feature embedding of each data point, combined with Sequential Bayesian Quadrature (SBQ) for efficient selection of examples.

Using Cross-Loss Influence Functions to Explain Deep Network Representations

This work provides the first theoretical and empirical demonstration that influence functions can be extended to handle mismatched training and testing settings and enables us to compute the influence of unsupervised and self-supervised training examples with respect to a supervised test objective.

Right for Better Reasons: Training Differentiable Models by Constraining their Influence Functions

This paper demonstrates how to make use of influence functions---a well known robust statistic---in the constraints to correct the model’s behaviour more effectively and showcases the effectiveness of RBR in correcting "Clever Hans"-like behaviour in real, high-dimensional domain.

Learning to Abstain via Curve Optimization

This work develops a novel approach to the problem of selecting a budget-constrained subset of test examples to abstain on, by analytically optimizing the expected marginal improvement in a desired performance metric, such as the area under the ROC curve or Precision-Recall curve.

Minimal Explanations for Neural Network Predictions

This paper proposes a novel approach which can be effectively exploited, either in isolation or in combination with other methods, to enhance the interpretability of neural model predictions and shows that its tractability result extends seamlessly to more advanced neural architectures such as convolutional and graph neural networks.

Explaining Neural Matrix Factorization with Gradient Rollback

It is shown theoretically that the difference between gradient rollback's influence approximation and the true influence on a model's behavior is smaller than known bounds on the stability of stochastic gradient descent, establishing that gradient roll back is robustly estimating example influence.

Explaining Black Box Predictions and Unveiling Data Artifacts through Influence Functions

It is found that influence functions are particularly useful for natural language inference, a task in which ‘saliency maps’ may not have clear interpretation, and a new quantitative measure based on influence functions that can reveal artifacts in training data is developed.

Infuence functions in Machine Learning tasks

This work extends the influence functions framework to cover more Machine Learning tasks, so that they can be used more widely in this field to understand and improve training and performance.



Explaining and Harnessing Adversarial Examples

It is argued that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature, supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets.

Auditing Black-box Models by Obscuring Features

A class of techniques originally developed for the detection and repair of disparate impact in classification models can be used to study the sensitivity of any model with respect to any feature subsets, and does not require the black-box model to be retrained.

Adam: A Method for Stochastic Optimization

This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework.

DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks

The DeepFool algorithm is proposed to efficiently compute perturbations that fool deep networks, and thus reliably quantify the robustness of these classifiers, and outperforms recent methods in the task of computing adversarial perturbation and making classifiers more robust.

Rethinking the Inception Architecture for Computer Vision

This work is exploring ways to scale up networks in ways that aim at utilizing the added computation as efficiently as possible by suitably factorized convolutions and aggressive regularization.

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

LIME is proposed, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning aninterpretable model locally varound the prediction.

“Influence sketching”: Finding influential samples in large-scale regressions

A new scalable version of Cook's distance is developed, a classical statistical technique for identifying samples which unusually strongly impact the fit of a regression model (and its downstream predictions), and a new algorithm which is called “influence sketching” is introduced, which can reliably and successfully discover influential samples.

Understanding Neural Networks through Representation Erasure

This paper proposes a general methodology to analyze and interpret decisions from a neural model by observing the effects on the model of erasing various parts of the representation, such as input word-vector dimensions, intermediate hidden units, or input words.

Deep learning via Hessian-free optimization

A 2nd-order optimization method based on the "Hessian-free" approach is developed, and applied to training deep auto-encoders, and results superior to those reported by Hinton & Salakhutdinov (2006) are obtained.

Auditing black-box models for indirect influence

This paper presents a technique for auditing black-box models, which lets us study the extent to which existing models take advantage of particular features in the data set, without knowing how the models work.