HotFlip: White-Box Adversarial Examples for Text Classification
@inproceedings{Ebrahimi2017HotFlipWA, title={HotFlip: White-Box Adversarial Examples for Text Classification}, author={J. Ebrahimi and Anyi Rao and Daniel Lowd and Dejing Dou}, booktitle={Annual Meeting of the Association for Computational Linguistics}, year={2017} }
We propose an efficient method to generate white-box adversarial examples to trick a character-level neural classifier. We find that only a few manipulations are needed to greatly decrease the accuracy. Our method relies on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors. Due to efficiency of our method, we can perform adversarial training which makes the model more robust to attacks at test time. With the use of a few semantics…
625 Citations
Generating Black-Box Adversarial Examples for Text Classifiers Using a Deep Reinforced Model
- Computer ScienceECML/PKDD
- 2019
A reinforcement learning based approach towards generating adversarial examples in black-box settings that is able to fool well-trained models for IMDB sentiment classification task and AG's news corpus news categorization task with significantly high success rates.
A Differentiable Language Model Adversarial Attack on Text Classifiers
- Computer ScienceIEEE Access
- 2022
A new black-box sentence-level attack that fine-tunes a pre-trained language model to generate adversarial examples that outperforms competitors on a diverse set of NLP problems for both computed metrics and human evaluation.
Detecting Word-Level Adversarial Text Attacks via SHapley Additive exPlanations
- Computer ScienceREPL4NLP
- 2022
This work adapts a technique from computer vision to detect word-level attacks targeting text classifiers using Shapley additive explanations and proves the detector requires only a low amount of training samples and generalizes to different datasets without needing to retrain.
BAE: BERT-based Adversarial Examples for Text Classification
- Computer ScienceEMNLP
- 2020
This work presents BAE, a powerful black box attack for generating grammatically correct and semantically coherent adversarial examples, and shows that BAE performs a stronger attack on three widely used models for seven text classification datasets.
STRATA: Simple, Gradient-Free Attacks for Models of Code
- Computer Science
- 2020
This work identifies a striking relationship between token frequency statistics and learned token embeddings: the L2 norm of learned token embeddeddings increases with the frequency of the token except for the highest-frequnecy tokens.
Block-Sparse Adversarial Attack to Fool Transformer-Based Text Classifiers
- Computer ScienceICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
- 2022
Experimental results demonstrate that, while the adversarial attack maintains the semantics of the sentence, it can reduce the accuracy of GPT-2 to less than 5% on different datasets (AG News, MNLI, and Yelp Reviews).
On Adversarial Examples for Character-Level Neural Machine Translation
- Computer ScienceCOLING
- 2018
This work investigates adversarial examples for character-level neural machine translation (NMT), and proposes two novel types of attacks which aim to remove or change a word in a translation, rather than simply break the NMT.
Universal Adversarial Perturbation for Text Classification
- Computer ScienceArXiv
- 2019
This work proposes an algorithm to compute universal adversarial perturbations, and shows that the state-of-the-art deep neural networks are highly vulnerable to them, even though they keep the neighborhood of tokens mostly preserved.
CRank: Reusable Word Importance Ranking for Text Adversarial Attack
- Computer ScienceApplied Sciences
- 2021
This paper proposes CRank, a black box method utilized by the innovated masking and ranking strategy, which improves efficiency by 75% at the ’cost’ of only a 1% drop of the success rate when compared to the classic method.
A Simple Yet Efficient Method for Adversarial Word-Substitute Attack
- Computer ScienceArXiv
- 2022
This research highlights that an adversary can fool a deep NLP model with much less cost and maintain the attack effectiveness.
References
SHOWING 1-10 OF 25 REFERENCES
Adversarial Training Methods for Semi-Supervised Text Classification
- Computer ScienceICLR
- 2017
This work extends adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself.
Generating Natural Adversarial Examples
- Computer ScienceICLR
- 2018
This paper proposes a framework to generate natural and legible adversarial examples that lie on the data manifold, by searching in semantic space of dense and continuous data representation, utilizing the recent advances in generative adversarial networks.
Explaining and Harnessing Adversarial Examples
- Computer ScienceICLR
- 2015
It is argued that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature, supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets.
Learning Robust Representations of Text
- Computer ScienceEMNLP
- 2016
Empirical evaluation over a range of sentiment datasets with a convolutional neural network shows that the regularization based method achieves superior performance over noisy inputs and out-of-domain data.
The Limitations of Deep Learning in Adversarial Settings
- Computer Science2016 IEEE European Symposium on Security and Privacy (EuroS&P)
- 2016
This work formalizes the space of adversaries against deep neural networks (DNNs) and introduces a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs.
Towards Evaluating the Robustness of Neural Networks
- Computer Science2017 IEEE Symposium on Security and Privacy (SP)
- 2017
It is demonstrated that defensive distillation does not significantly increase the robustness of neural networks, and three new attack algorithms are introduced that are successful on both distilled and undistilled neural networks with 100% probability are introduced.
Towards Deep Learning Models Resistant to Adversarial Attacks
- Computer ScienceICLR
- 2018
This work studies the adversarial robustness of neural networks through the lens of robust optimization, and suggests the notion of security against a first-order adversary as a natural and broad security guarantee.
Crafting adversarial input sequences for recurrent neural networks
- Computer ScienceMILCOM 2016 - 2016 IEEE Military Communications Conference
- 2016
This paper investigates adversarial input sequences for recurrent neural networks processing sequential data and shows that the classes of algorithms introduced previously to craft adversarial samples misclassified by feed-forward neural networks can be adapted to recurrent Neural networks.
Adversarial Example Generation with Syntactically Controlled Paraphrase Networks
- Computer ScienceNAACL
- 2018
A combination of automated and human evaluations show that SCPNs generate paraphrases that follow their target specifications without decreasing paraphrase quality when compared to baseline (uncontrolled) paraphrase systems.
Adversarial learning
- Computer ScienceKDD '05
- 2005
This paper introduces the adversarial classifier reverse engineering (ACRE) learning problem, the task of learning sufficient information about a classifier to construct adversarial attacks, and presents efficient algorithms for reverse engineering linear classifiers with either continuous or Boolean features.