• Corpus ID: 168169826

Analyzing the Interpretability Robustness of Self-Explaining Models

  title={Analyzing the Interpretability Robustness of Self-Explaining Models},
  author={Haizhong Zheng and Earlence Fernandes and Atul Prakash},
Recently, interpretable models called self-explaining models (SEMs) have been proposed with the goal of providing interpretability robustness. We evaluate the interpretability robustness of SEMs and show that explanations provided by SEMs as currently proposed are not robust to adversarial inputs. Specifically, we successfully created adversarial inputs that do not change the model outputs but cause significant changes in the explanations. We find that even though current SEMs use stable co… 

Figures and Tables from this paper

When and How to Fool Explainable Models (and Humans) with Adversarial Examples

This paper proposes a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable models under human assessment, introducing novel attack paradigms.

This looks more like that: Enhancing Self-Explaining Models by Prototypical Relevance Propagation

This work provides a detailed case study of the self-explaining network, ProtoPNet, in the presence of a spectrum of artifacts, and introduces Prototypical Relevance Propagation (PRP), a novel method for generating more precise model-aware explanations.

Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

In this survey, literature on techniques for interpreting the inner components of DNNs, which are called inner interpretability methods are reviewed, with a focus on how these techniques relate to the goal of designing safer, more trustworthy AI systems.

Certified Interpretability Robustness for Class Activation Mapping

Interpreting machine learning models is challenging but crucial for ensuring the safety of deep networks in autonomous driving systems. Due to the prevalence of deep learning based perception models

CXAI: Explaining Convolutional Neural Networks for Medical Imaging Diagnostic

Two major directions for explaining convolutional neural networks are investigated: feature-based post hoc explanatory methods that try to explain already trained and fixed target models and preliminary analysis and choice of the model architecture with an accuracy of 98% ± 0.156% from 36 CNN architectures with different configurations.

Measuring Association Between Labels and Free-Text Rationales

It is demonstrated that *pipelines*, models for faithful rationalization on information-extraction style tasks, do not work as well on “reasoning” tasks requiring free-text rationales, and state-of-the-art T5-based joint models exhibit desirable properties for explaining commonsense question-answering and natural language inference.



Towards Robust Interpretability with Self-Explaining Neural Networks

This work designs self-explaining models in stages, progressively generalizing linear classifiers to complex yet architecturally explicit models, and proposes three desiderata for explanations in general – explicitness, faithfulness, and stability.

On the Robustness of Interpretability Methods

We argue that robustness of explanations---i.e., that similar inputs should give rise to similar explanations---is a key desideratum for interpretability. We introduce metrics to quantify robustness

Interpretation of Neural Networks is Fragile

This paper systematically characterize the fragility of several widely-used feature-importance interpretation methods (saliency maps, relevance propagation, and DeepLIFT) on ImageNet and CIFAR-10 and extends these results to show that interpretations based on exemplars (e.g. influence functions) are similarly fragile.

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

LIME is proposed, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning aninterpretable model locally varound the prediction.

Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead

  • C. Rudin
  • Computer Science
    Nat. Mach. Intell.
  • 2019
This Perspective clarifies the chasm between explaining black boxes and using inherently interpretable models, outlines several key reasons why explainable black boxes should be avoided in high-stakes decisions, identifies challenges to interpretable machine learning, and provides several example applications whereinterpretable models could potentially replace black box models in criminal justice, healthcare and computer vision.

Axiomatic Attribution for Deep Networks

We study the problem of attributing the prediction of a deep network to its input features, a problem previously studied by several other works. We identify two fundamental axioms— Sensitivity and

This looks like that: deep learning for interpretable image recognition

A deep network architecture -- prototypical part network (ProtoPNet), that reasons in a similar way to the way ornithologists, physicians, and others would explain to people on how to solve challenging image classification tasks, that provides a level of interpretability that is absent in other interpretable deep models.

Deep Learning for Case-based Reasoning through Prototypes: A Neural Network that Explains its Predictions

This work creates a novel network architecture for deep learning that naturally explains its own reasoning for each prediction, and the explanations are loyal to what the network actually computes.

Towards Deep Learning Models Resistant to Adversarial Attacks

This work studies the adversarial robustness of neural networks through the lens of robust optimization, and suggests the notion of security against a first-order adversary as a natural and broad security guarantee.

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

This paper addresses the visualisation of image classification models, learnt using deep Convolutional Networks (ConvNets), and establishes the connection between the gradient-based ConvNet visualisation methods and deconvolutional networks.