On the Lack of Robust Interpretability of Neural Text Classifiers

@article{Zafar2021OnTL,
  title={On the Lack of Robust Interpretability of Neural Text Classifiers},
  author={Muhammad Bilal Zafar and Michele Donini and Dylan Slack and C. Archambeau and Sanjiv Ranjan Das and Krishnaram Kenthapadi},
  journal={ArXiv},
  year={2021},
  volume={abs/2106.04631}
}
With the ever-increasing complexity of neural language models, practitioners have turned to methods for understanding the predictions of these models. One of the most welladopted approaches for model interpretability is feature-based interpretability, i.e., ranking the features in terms of their impact on model predictions. Several prior studies have focused on assessing the fidelity of feature-based interpretability methods, i.e., measuring the impact of dropping the top-ranked features on the… 

Figures and Tables from this paper

More Than Words: Towards Better Quality Interpretations of Text Classifiers

TLDR
Higher-level feature attributions offer several advantages and are more intelligible to humans in situations where the linguistic coherence resides at a higher granularity level, and token-based interpretability, while being a convenient first choice given the input interfaces of the ML models, is not the most effective one in all situations.

Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing

TLDR
This paper demonstrates how interpretations can be manipulated by making simple word perturbations on an input text to attack two SOTA interpretation methods, across three popular Transformer models and on three different NLP datasets.

Thermostat: A Large Collection of NLP Model Explanations and Analysis Tools

TLDR
Thermostat is a large collection of model explanations and accompanying analysis tools that democratizes explainability research in the language domain, circumvents redundant computations and increases comparability and replicability.

TalkToModel: Understanding Machine Learning Models With Open Ended Dialogues

TLDR
TalkToModel is introduced: an open-ended dialogue system for understanding machine learning models that understands user inputs on novel datasets and models with high accuracy and is presented as a new category of model understanding tools for practitioners.

TalkToModel: Explaining Machine Learning Models with Interactive Natural Language Conversations

TLDR
TalkToModel is introduced: an interactive dialogue system for explaining machine learning models through conversations that understands user inputs on novel datasets and models with high accuracy and is introduced as a new category of explainability tools for practitioners.

Robustness Analysis of Grover for Machine-Generated News Detection

TLDR
An investigation of Grover’s susceptibility to adversarial attacks such as character-level and word-level perturbations shows that even a singular character alteration can cause Grover to fail, exposing a lack of robustness.

Measuring Representational Robustness of Neural Networks Through Shared Invariances

TLDR
This work offers a new view on robustness by using another reference NN to define the set of perturbations a given NN should be invariant to, thus generalizing the reliance on a reference “human NN” to any NN.

Rethinking Explainability as a Dialogue: A Practitioner's Perspective

TLDR
A set of five principles researchers should follow when designing interactive explanations are outlined as a starting place for future work and it is shown why natural language dialogues satisfy these principles and are a desirable way to build interactive explanations.

References

SHOWING 1-10 OF 61 REFERENCES

A Diagnostic Study of Explainability Techniques for Text Classification

TLDR
A comprehensive list of diagnostic properties for evaluating existing explainability techniques is developed and it is found that the gradient-based explanations perform best across tasks and model architectures, and further insights into the properties are presented.

Pathologies of Neural Models Make Interpretations Difficult

TLDR
This work uses input reduction, which iteratively removes the least important word from the input, to expose pathological behaviors of neural models: the remaining words appear nonsensical to humans and are not the ones determined as important by interpretation methods.

Comparing Automatic and Human Evaluation of Local Explanations for Text Classification

TLDR
A variety of local explanation approaches using automatic measures based on word deletion are evaluated, showing that an evaluation using a crowdsourcing experiment correlates moderately with these automatic measures and that a variety of other factors also impact the human judgements.

Visualizing and Understanding Neural Models in NLP

TLDR
Four strategies for visualizing compositionality in neural models for NLP, inspired by similar work in computer vision, including LSTM-style gates that measure information flow and gradient back-propagation, are described.

Towards Robust Interpretability with Self-Explaining Neural Networks

TLDR
This work designs self-explaining models in stages, progressively generalizing linear classifiers to complex yet architecturally explicit models, and proposes three desiderata for explanations in general – explicitness, faithfulness, and stability.

A Unified Approach to Interpreting Model Predictions

TLDR
A unified framework for interpreting predictions, SHAP (SHapley Additive exPlanations), which unifies six existing methods and presents new methods that show improved computational performance and/or better consistency with human intuition than previous approaches.

ERASER: A Benchmark to Evaluate Rationalized NLP Models

TLDR
This work proposes the Evaluating Rationales And Simple English Reasoning (ERASER) a benchmark to advance research on interpretable models in NLP, and proposes several metrics that aim to capture how well the rationales provided by models align with human rationales, and also how faithful these rationales are.

AllenNLP Interpret: A Framework for Explaining Predictions of NLP Models

TLDR
This work introduces AllenNLP Interpret, a flexible framework for interpreting NLP models, which provides interpretation primitives for anyAllenNLP model and task, a suite of built-in interpretation methods, and a library of front-end visualization components.

Explaining Explanations: An Overview of Interpretability of Machine Learning

There has recently been a surge of work in explanatory artificial intelligence (XAI). This research area tackles the important problem that complex machines and algorithms often cannot provide

"Why Should I Trust You?": Explaining the Predictions of Any Classifier

TLDR
LIME is proposed, a novel explanation technique that explains the predictions of any classifier in an interpretable and faithful manner, by learning aninterpretable model locally varound the prediction.
...