Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing

  title={Perturbing Inputs for Fragile Interpretations in Deep Natural Language Processing},
  author={Sanchit Sinha and Hanjie Chen and Arshdeep Sekhon and Yangfeng Ji and Yanjun Qi},
Interpretability methods like Integrated Gradient and LIME are popular choices for explaining natural language model predictions with relative word importance scores. These interpretations need to be robust for trustworthy NLP applications in high-stake areas like medicine or finance. Our paper demonstrates how interpretations can be manipulated by making simple word perturbations on an input text. Via a small portion of word-level swaps, these adversarial perturbations aim to make the… 

Fooling Explanations in Text Classifiers

A novel explanation attack algorithm that alters text input samples imperceptibly so that the outcome of widely-used explanation methods changes considerably while leaving classifier predictions unchanged, which shows that all models and explanation methods are susceptible to TEF perturbations.

Logic Traps in Evaluating Attribution Scores

This paper systematically reviews existing methods for evaluating attribution scores and summarizes the logic traps in these methods, and suggests stopping focusing on improving performance under unreliable evaluation systems and starting efforts on reducing the impact of proposed logic traps.

Thermostat: A Large Collection of NLP Model Explanations and Analysis Tools

Thermostat is a large collection of model explanations and accompanying analysis tools that democratizes explainability research in the language domain, circumvents redundant computations and increases comparability and replicability.

Identification of Violence in Twitter Using a Custom Lexicon and NLP

  • Jonathan Adkins
  • Computer Science
    European Conference on Cyber Warfare and Security
  • 2022
A technique in which the potential for increased violence within a community can be identified and measured using a combination of text mining, sentiment analysis, and graph theory, and it is asserted that this approach will provide cybersecurity and homeland security analysts with actionable threat intelligence.

Beware the Rationalization Trap! When Language Model Explainability Diverges from our Mental Models of Language

This position paper argues that in order to avoid harmful rationalization and achieve truthful understanding of language models, explanation processes must satisfy three main conditions: truthfully represent the model behavior, have a high reputation and take the user’s mental model into account.



Semantically Equivalent Adversarial Rules for Debugging NLP models

This work presents semantically equivalent adversaries (SEAs) – semantic-preserving perturbations that induce changes in the model’s predictions that induce adversaries on many instances that are extremely similar semantically.

Contextualized Perturbation for Textual Adversarial Attack

CLARE is a ContextuaLized AdversaRial Example generation model that produces fluent and grammatical outputs through a mask-then-infill procedure that can flexibly combine and apply perturbations at any position in the inputs, and is thus able to attack the victim model more effectively with fewer edits.

Gradient-based Analysis of NLP Models is Manipulable

This paper merge the layers of a target model with a Facade Model that overwhelms the gradients without affecting the predictions, and shows that the merged model effectively fools different analysis tools.

Generating Natural Language Adversarial Examples

A black-box population-based optimization algorithm is used to generate semantically and syntactically similar adversarial examples that fool well-trained sentiment analysis and textual entailment models with success rates of 97% and 70%, respectively.

Interpretation of Neural Networks is Fragile

This paper systematically characterize the fragility of several widely-used feature-importance interpretation methods (saliency maps, relevance propagation, and DeepLIFT) on ImageNet and CIFAR-10 and extends these results to show that interpretations based on exemplars (e.g. influence functions) are similarly fragile.

Visualizing and Understanding Neural Models in NLP

Four strategies for visualizing compositionality in neural models for NLP, inspired by similar work in computer vision, including LSTM-style gates that measure information flow and gradient back-propagation, are described.

Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment

TextFooler is presented, a simple but strong baseline to generate adversarial text that outperforms previous attacks by success rate and perturbation rate, and is utility-preserving and efficient, which generates adversarialtext with computational complexity linear to the text length.

Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank

A Sentiment Treebank that includes fine grained sentiment labels for 215,154 phrases in the parse trees of 11,855 sentences and presents new challenges for sentiment compositionality, and introduces the Recursive Neural Tensor Network.

Understanding Neural Networks through Representation Erasure

This paper proposes a general methodology to analyze and interpret decisions from a neural model by observing the effects on the model of erasing various parts of the representation, such as input word-vector dimensions, intermediate hidden units, or input words.

Generating Natural Adversarial Examples

This paper proposes a framework to generate natural and legible adversarial examples that lie on the data manifold, by searching in semantic space of dense and continuous data representation, utilizing the recent advances in generative adversarial networks.