HotFlip: White-Box Adversarial Examples for Text Classification

  title={HotFlip: White-Box Adversarial Examples for Text Classification},
  author={J. Ebrahimi and Anyi Rao and Daniel Lowd and Dejing Dou},
  booktitle={Annual Meeting of the Association for Computational Linguistics},
We propose an efficient method to generate white-box adversarial examples to trick a character-level neural classifier. We find that only a few manipulations are needed to greatly decrease the accuracy. Our method relies on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors. Due to efficiency of our method, we can perform adversarial training which makes the model more robust to attacks at test time. With the use of a few semantics… 

Figures and Tables from this paper

Generating Black-Box Adversarial Examples for Text Classifiers Using a Deep Reinforced Model

A reinforcement learning based approach towards generating adversarial examples in black-box settings that is able to fool well-trained models for IMDB sentiment classification task and AG's news corpus news categorization task with significantly high success rates.

A Differentiable Language Model Adversarial Attack on Text Classifiers

A new black-box sentence-level attack that fine-tunes a pre-trained language model to generate adversarial examples that outperforms competitors on a diverse set of NLP problems for both computed metrics and human evaluation.

Detecting Word-Level Adversarial Text Attacks via SHapley Additive exPlanations

This work adapts a technique from computer vision to detect word-level attacks targeting text classifiers using Shapley additive explanations and proves the detector requires only a low amount of training samples and generalizes to different datasets without needing to retrain.

BAE: BERT-based Adversarial Examples for Text Classification

This work presents BAE, a powerful black box attack for generating grammatically correct and semantically coherent adversarial examples, and shows that BAE performs a stronger attack on three widely used models for seven text classification datasets.

STRATA: Simple, Gradient-Free Attacks for Models of Code

This work identifies a striking relationship between token frequency statistics and learned token embeddings: the L2 norm of learned token embeddeddings increases with the frequency of the token except for the highest-frequnecy tokens.

Block-Sparse Adversarial Attack to Fool Transformer-Based Text Classifiers

Experimental results demonstrate that, while the adversarial attack maintains the semantics of the sentence, it can reduce the accuracy of GPT-2 to less than 5% on different datasets (AG News, MNLI, and Yelp Reviews).

On Adversarial Examples for Character-Level Neural Machine Translation

This work investigates adversarial examples for character-level neural machine translation (NMT), and proposes two novel types of attacks which aim to remove or change a word in a translation, rather than simply break the NMT.

Universal Adversarial Perturbation for Text Classification

This work proposes an algorithm to compute universal adversarial perturbations, and shows that the state-of-the-art deep neural networks are highly vulnerable to them, even though they keep the neighborhood of tokens mostly preserved.

CRank: Reusable Word Importance Ranking for Text Adversarial Attack

This paper proposes CRank, a black box method utilized by the innovated masking and ranking strategy, which improves efficiency by 75% at the ’cost’ of only a 1% drop of the success rate when compared to the classic method.

A Simple Yet Efficient Method for Adversarial Word-Substitute Attack

This research highlights that an adversary can fool a deep NLP model with much less cost and maintain the attack effectiveness.



Adversarial Training Methods for Semi-Supervised Text Classification

This work extends adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself.

Generating Natural Adversarial Examples

This paper proposes a framework to generate natural and legible adversarial examples that lie on the data manifold, by searching in semantic space of dense and continuous data representation, utilizing the recent advances in generative adversarial networks.

Explaining and Harnessing Adversarial Examples

It is argued that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature, supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets.

Learning Robust Representations of Text

Empirical evaluation over a range of sentiment datasets with a convolutional neural network shows that the regularization based method achieves superior performance over noisy inputs and out-of-domain data.

The Limitations of Deep Learning in Adversarial Settings

This work formalizes the space of adversaries against deep neural networks (DNNs) and introduces a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs.

Towards Evaluating the Robustness of Neural Networks

It is demonstrated that defensive distillation does not significantly increase the robustness of neural networks, and three new attack algorithms are introduced that are successful on both distilled and undistilled neural networks with 100% probability are introduced.

Towards Deep Learning Models Resistant to Adversarial Attacks

This work studies the adversarial robustness of neural networks through the lens of robust optimization, and suggests the notion of security against a first-order adversary as a natural and broad security guarantee.

Crafting adversarial input sequences for recurrent neural networks

This paper investigates adversarial input sequences for recurrent neural networks processing sequential data and shows that the classes of algorithms introduced previously to craft adversarial samples misclassified by feed-forward neural networks can be adapted to recurrent Neural networks.

Adversarial Example Generation with Syntactically Controlled Paraphrase Networks

A combination of automated and human evaluations show that SCPNs generate paraphrases that follow their target specifications without decreasing paraphrase quality when compared to baseline (uncontrolled) paraphrase systems.

Adversarial learning

This paper introduces the adversarial classifier reverse engineering (ACRE) learning problem, the task of learning sufficient information about a classifier to construct adversarial attacks, and presents efficient algorithms for reverse engineering linear classifiers with either continuous or Boolean features.