• Corpus ID: 168169826

Analyzing the Interpretability Robustness of Self-Explaining Models

@article{Zheng2019AnalyzingTI,
  title={Analyzing the Interpretability Robustness of Self-Explaining Models},
  author={Haizhong Zheng and Earlence Fernandes and Atul Prakash},
  journal={ArXiv},
  year={2019},
  volume={abs/1905.12429}
}
Recently, interpretable models called self-explaining models (SEMs) have been proposed with the goal of providing interpretability robustness. We evaluate the interpretability robustness of SEMs and show that explanations provided by SEMs as currently proposed are not robust to adversarial inputs. Specifically, we successfully created adversarial inputs that do not change the model outputs but cause significant changes in the explanations. We find that even though current SEMs use stable co… 

Figures and Tables from this paper

When and How to Fool Explainable Models (and Humans) with Adversarial Examples
TLDR
This paper proposes a comprehensive framework to study whether (and how) adversarial examples can be generated for explainable models under human assessment, introducing novel attack paradigms.
This looks more like that: Enhancing Self-Explaining Models by Prototypical Relevance Propagation
TLDR
This work provides a detailed case study of the self-explaining network, ProtoPNet, in the presence of a spectrum of artifacts, and introduces Prototypical Relevance Propagation (PRP), a novel method for generating more precise model-aware explanations.
Measuring Association Between Labels and Free-Text Rationales
TLDR
It is demonstrated that *pipelines*, models for faithful rationalization on information-extraction style tasks, do not work as well on “reasoning” tasks requiring free-text rationales, and state-of-the-art T5-based joint models exhibit desirable properties for explaining commonsense question-answering and natural language inference.
Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks
TLDR
In this survey, literature on techniques for interpreting the inner components of DNNs, which are called inner interpretability methods are reviewed, with a focus on how these techniques relate to the goal of designing safer, more trustworthy AI systems.
CXAI: Explaining Convolutional Neural Networks for Medical Imaging Diagnostic
TLDR
Two major directions for explaining convolutional neural networks are investigated: feature-based post hoc explanatory methods that try to explain already trained and fixed target models and preliminary analysis and choice of the model architecture with an accuracy of 98% ± 0.156% from 36 CNN architectures with different configurations.
Certified Interpretability Robustness for Class Activation Mapping
Interpreting machine learning models is challenging but crucial for ensuring the safety of deep networks in autonomous driving systems. Due to the prevalence of deep learning based perception models

References

SHOWING 1-10 OF 11 REFERENCES
Towards Robust Interpretability with Self-Explaining Neural Networks
TLDR
This work designs self-explaining models in stages, progressively generalizing linear classifiers to complex yet architecturally explicit models, and proposes three desiderata for explanations in general – explicitness, faithfulness, and stability.
On the Robustness of Interpretability Methods
We argue that robustness of explanations---i.e., that similar inputs should give rise to similar explanations---is a key desideratum for interpretability. We introduce metrics to quantify robustness
Interpretation of Neural Networks is Fragile
TLDR
This paper systematically characterize the fragility of several widely-used feature-importance interpretation methods (saliency maps, relevance propagation, and DeepLIFT) on ImageNet and CIFAR-10 and extends these results to show that interpretations based on exemplars (e.g. influence functions) are similarly fragile.
"Why Should I Trust You?": Explaining the Predictions of Any Classifier
TLDR
LIME is proposed, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning aninterpretable model locally varound the prediction.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
  • C. Rudin
  • Computer Science
    Nat. Mach. Intell.
  • 2019
TLDR
This Perspective clarifies the chasm between explaining black boxes and using inherently interpretable models, outlines several key reasons why explainable black boxes should be avoided in high-stakes decisions, identifies challenges to interpretable machine learning, and provides several example applications whereinterpretable models could potentially replace black box models in criminal justice, healthcare and computer vision.
Axiomatic Attribution for Deep Networks
We study the problem of attributing the prediction of a deep network to its input features, a problem previously studied by several other works. We identify two fundamental axioms— Sensitivity and
Deep Learning for Case-based Reasoning through Prototypes: A Neural Network that Explains its Predictions
TLDR
This work creates a novel network architecture for deep learning that naturally explains its own reasoning for each prediction, and the explanations are loyal to what the network actually computes.
Layer-Wise Relevance Propagation for Neural Networks with Local Renormalization Layers
TLDR
This paper proposes an approach to extend layer-wise relevance propagation to neural networks with local renormalization layers, which is a very common product-type non-linearity in convolutional neural networks.
Towards Deep Learning Models Resistant to Adversarial Attacks
TLDR
This work studies the adversarial robustness of neural networks through the lens of robust optimization, and suggests the notion of security against a first-order adversary as a natural and broad security guarantee.
This looks like that: deep learning for interpretable image recognition
TLDR
A deep network architecture -- prototypical part network (ProtoPNet), that reasons in a similar way to the way ornithologists, physicians, and others would explain to people on how to solve challenging image classification tasks, that provides a level of interpretability that is absent in other interpretable deep models.
...
...