Corpus ID: 168169826

Analyzing the Interpretability Robustness of Self-Explaining Models

@article{Zheng2019AnalyzingTI,
  title={Analyzing the Interpretability Robustness of Self-Explaining Models},
  author={Haizhong Zheng and E. Fernandes and A. Prakash},
  journal={ArXiv},
  year={2019},
  volume={abs/1905.12429}
}
  • Haizhong Zheng, E. Fernandes, A. Prakash
  • Published 2019
  • Computer Science, Mathematics
  • ArXiv
  • Recently, interpretable models called self-explaining models (SEMs) have been proposed with the goal of providing interpretability robustness. We evaluate the interpretability robustness of SEMs and show that explanations provided by SEMs as currently proposed are not robust to adversarial inputs. Specifically, we successfully created adversarial inputs that do not change the model outputs but cause significant changes in the explanations. We find that even though current SEMs use stable co… CONTINUE READING
    1 Citations

    Figures, Tables, and Topics from this paper.

    Explore Further: Topics Discussed in This Paper

    Measuring Association Between Labels and Free-Text Rationales

    References

    SHOWING 1-10 OF 11 REFERENCES
    Towards Robust Interpretability with Self-Explaining Neural Networks
    • 169
    • Highly Influential
    • PDF
    On the Robustness of Interpretability Methods
    • 76
    • PDF
    "Why Should I Trust You?": Explaining the Predictions of Any Classifier
    • 3,504
    • PDF
    Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
    • 484
    • PDF
    Axiomatic Attribution for Deep Networks
    • 942
    • PDF
    Interpretation of Neural Networks is Fragile
    • 187
    • PDF
    This looks like that: deep learning for interpretable image recognition
    • 116
    • PDF
    Deep Learning for Case-based Reasoning through Prototypes: A Neural Network that Explains its Predictions
    • 125
    • Highly Influential
    • PDF
    Towards Deep Learning Models Resistant to Adversarial Attacks
    • 2,513
    • Highly Influential
    • PDF
    Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
    • 2,824
    • PDF