Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks

  title={Toward Transparent AI: A Survey on Interpreting the Inner Structures of Deep Neural Networks},
  author={Tilman Raukur and An Chang Ho and Stephen Casper and Dylan Hadfield-Menell},
—The last decade of machine learning has seen drastic increases in scale and capabilities, and deep neural networks (DNNs) are increasingly being deployed across a wide range of domains. However, the inner workings of DNNs are generally difficult to understand, raising concerns about the safety of using these systems without a rigorous understanding of how they function. In this survey, we review literature on techniques for interpreting the inner components of DNNs, which we call inner… 

Figures from this paper

Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

This investigation is the largest end-to-end attempt at reverse-engineering a natural behavior “in the wild” in a language model, and provides evidence that a mechanistic understanding of large ML models is feasible, pointing toward opportunities to scale the understanding to both larger models and more complex tasks.

Robust Feature-Level Adversaries are Interpretability Tools

The results indicate that feature-level attacks are a promising approach for rigorous interpretability research and support the design of tools to better understand what a model has learned and diagnose brittle feature associations.

An Explainable Self-Labeling Grey-Box Model

This work studies an explanation approach so-called the Grey-Box model, which uses a self-labeling framework based on a semi-supervised methodology to exploit the benifits of black-box and white-box models.

Diagnostics for Deep Neural Networks with Automated Copy/Paste Attacks

A method termed Search for Natural Adversarial Features Using Embeddings (SNAFUE) is introduced which offers a fully-automated method for "copy/paste" attacks in which one natural image can be pasted into another in order to induce an unrelated misclassification.



Towards Interpretable Deep Neural Networks by Leveraging Adversarial Examples

This work aims to increase the interpretability of DNNs on the whole image space by reducing the ambiguity of neurons by proposing a metric to evaluate the consistency level of neurons in a network quantitatively.

Explaining Deep Neural Networks and Beyond: A Review of Methods and Applications

This work aims to provide a timely overview of this active emerging field of XAI, with a focus on “post hoc” explanations, and explain its theoretical foundations, and put interpretability algorithms to a test both from a theory and comparative evaluation perspective.

An Empirical Study on the Relation Between Network Interpretability and Adversarial Robustness

It is demonstrated that training the networks to have interpretable gradients improves their robustness to adversarial perturbations, and the results indicate that the interpretability of the model gradients is a crucial factor for adversarial robustness.

Improving the Adversarial Robustness and Interpretability of Deep Neural Networks by Regularizing their Input Gradients

It is demonstrated that regularizing input gradients makes them more naturally interpretable as rationales for model predictions, and also exhibits robustness to transferred adversarial examples generated to fool all of the other models.

Understanding the role of individual units in a deep neural network

This work presents network dissection, an analytic framework to systematically identify the semantics of individual hidden units within image classification and image generation networks, and applies it to understanding adversarial attacks and to semantic image editing.

GAN Dissection: Visualizing and Understanding Generative Adversarial Networks

This work presents an analytic framework to visualize and understand GANs at the unit-, object-, and scene-level, and provides open source interpretation tools to help researchers and practitioners better understand their GAN models.

Network Dissection: Quantifying Interpretability of Deep Visual Representations

This work uses the proposed Network Dissection method to test the hypothesis that interpretability is an axis-independent property of the representation space, then applies the method to compare the latent representations of various networks when trained to solve different classification problems.

Optimism in the Face of Adversity: Understanding and Improving Deep Learning Through Adversarial Robustness

The goal of this article is to provide readers with a set of new perspectives to understand deep learning and supply them with intuitive tools and insights on how to use adversarial robustness to improve it.

TopKConv: Increased Adversarial Robustness Through Deeper Interpretability

  • Henry EigenAmir Sadovnik
  • Computer Science
    2021 20th IEEE International Conference on Machine Learning and Applications (ICMLA)
  • 2021
It is proposed that more interpretable networks should yield more robust ones since they are able to rely on features that are more understandable to humans, and a sparsity based defense to counter the impact of overparameterization on adversarial vulnerability is proposed.

Transferred Discrepancy: Quantifying the Difference Between Representations

The transferred discrepancy (TD), a new metric that defines the difference between two representations based on their downstream-task performance, is introduced and it is shown that under specific conditions, the TD metric is closely related to previous metrics.