• Corpus ID: 16747630

Axiomatic Attribution for Deep Networks

@article{Sundararajan2017AxiomaticAF,
  title={Axiomatic Attribution for Deep Networks},
  author={Mukund Sundararajan and Ankur Taly and Qiqi Yan},
  journal={ArXiv},
  year={2017},
  volume={abs/1703.01365}
}
We study the problem of attributing the prediction of a deep network to its input features, a problem previously studied by several other works. [] Key Method We use the axioms to guide the design of a new attribution method called Integrated Gradients. Our method requires no modification to the original network and is extremely simple to implement; it just needs a few calls to the standard gradient operator. We apply this method to a couple of image models, a couple of text models and a chemistry model…

Figures from this paper

A Rigorous Study of Integrated Gradients Method and Extensions to Internal Neuron Attributions

TLDR
This paper identifies key differences between IG function spaces and the supporting literature’s function spaces which problematize previous claims of IG uniqueness and shows that with the introduction of an additional axiom, non-decreasing positivity, the uniqueness claims can be established.

A unified view of gradient-based attribution methods for Deep Neural Networks

TLDR
This work analyzes various state-of-the-art attribution methods and proves unexplored connections between them, and performs an empirical evaluation with six attribution methods on a variety of tasks and architectures.

Explaining Explanations: Axiomatic Feature Interactions for Deep Networks

TLDR
This work presents Integrated Hessians, an extension of Integrated Gradients that explains pairwise feature interactions in neural networks and finds that the method is faster than existing methods when the number of features is large, and outperforms previous methods on existing quantitative benchmarks.

Influence Decompositions For Neural Network Attribution

TLDR
A general framework for decomposing the orders of influence that a collection of input variables has on an output classification is proposed, based on the cardinality of input subsets which are perturbed to yield a change in classification.

Explanations for Attributing Deep Neural Network Predictions

TLDR
This chapter introduces Meta-Predictors as Explanations, a principled framework for learning explanations for any black box algorithm, and Meaningful Perturbations, an instantiation of the paradigm applied to the problem of attribution.

Towards better understanding of gradient-based attribution methods for Deep Neural Networks

TLDR
This work analyzes four gradient-based attribution methods and formally prove conditions of equivalence and approximation between them, and constructs a unified framework which enables a direct comparison, as well as an easier implementation.

Fast Axiomatic Attribution for Neural Networks

TLDR
It is formally proved that nonnegatively homogeneous DNNs, here termed X -DNNs), are efficiently axiomatically attributable and show that they can be effortlessly constructed from a wide range of regular Dnns by simply removing the bias term of each layer.

Causal Abstractions of Neural Networks

TLDR
It is discovered that a BERT-based model with state-of-the-art performance successfully realizes parts of the natural logic model’s causal structure, whereas a simpler baseline model fails to show any such structure, demonstrating that BERT representations encode the compositional structure of MQNLI.

Inserting Information Bottleneck for Attribution in Transformers

TLDR
This paper applies information bottlenecks to analyze the attribution of each feature for prediction on a black-box model and shows the effectiveness of the method in terms of attribution and the ability to provide insight into how information flows through layers.

Improving Feature Attribution through Input-specific Network Pruning

TLDR
It is shown that by input-specific pruning, network gradients change from reflecting local (noisy) importance information to global importance by means of gradient-based attribution maps.
...

References

SHOWING 1-10 OF 39 REFERENCES

An unexpected unity among methods for interpreting model predictions

TLDR
This work presents how a model-agnostic additive representation of the importance of input features unifies current methods and shows how this representation is optimal, in the sense that it is the only set of additive values that satisfies important properties.

Learning Important Features Through Propagating Activation Differences

TLDR
DeepLIFT (Deep Learning Important FeaTures), a method for decomposing the output prediction of a neural network on a specific input by backpropagating the contributions of all neurons in the network to every feature of the input, is presented.

Visualizing Higher-Layer Features of a Deep Network

TLDR
This paper contrast and compare several techniques applied on Stacked Denoising Autoencoders and Deep Belief Networks, trained on several vision datasets, and shows that good qualitative interpretations of high level features represented by such models are possible at the unit level.

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

TLDR
LIME is proposed, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning aninterpretable model locally varound the prediction.

Model-Agnostic Interpretability of Machine Learning

TLDR
This paper argues for explaining machine learning predictions using model-agnostic approaches, treating the machine learning models as black-box functions, which provide crucial flexibility in the choice of models, explanations, and representations, improving debugging, comparison, and interfaces for a variety of users and models.

Compositional Semantic Parsing on Semi-Structured Tables

TLDR
This paper proposes a logical-form driven parsing algorithm guided by strong typing constraints and shows that it obtains significant improvements over natural baselines and is made publicly available.

Explaining and Harnessing Adversarial Examples

TLDR
It is argued that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature, supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets.

Algorithmic Transparency via Quantitative Input Influence: Theory and Experiments with Learning Systems

TLDR
The transparency-privacy tradeoff is explored and it is proved that a number of useful transparency reports can be made differentially private with very little addition of noise.

Evaluating the Visualization of What a Deep Neural Network Has Learned

TLDR
A general methodology based on region perturbation for evaluating ordered collections of pixels such as heatmaps and shows that the recently proposed layer-wise relevance propagation algorithm qualitatively and quantitatively provides a better explanation of what made a DNN arrive at a particular classification decision than the sensitivity-based approach or the deconvolution method.

Building high-level features using large scale unsupervised learning

TLDR
Contrary to what appears to be a widely-held intuition, the experimental results reveal that it is possible to train a face detector without having to label images as containing a face or not.