• Corpus ID: 28190697

HotFlip: White-Box Adversarial Examples for NLP

@article{Ebrahimi2017HotFlipWA,
  title={HotFlip: White-Box Adversarial Examples for NLP},
  author={J. Ebrahimi and Anyi Rao and Daniel Lowd and Dejing Dou},
  journal={ArXiv},
  year={2017},
  volume={abs/1712.06751}
}
Adversarial examples expose vulnerabilities of machine learning models. [] Key Method Our method, HotFlip, relies on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors. In experiments on text classification and machine translation, we find that only a few manipulations are needed to greatly increase the error rates. We analyze the properties of these examples, and show that employing these adversarial examples in training can improve test-time…

Figures and Tables from this paper

Generating Black-Box Adversarial Examples for Text Classifiers Using a Deep Reinforced Model
TLDR
A reinforcement learning based approach towards generating adversarial examples in black-box settings that is able to fool well-trained models for IMDB sentiment classification task and AG's news corpus news categorization task with significantly high success rates.
Elephant in the Room: An Evaluation Framework for Assessing Adversarial Examples in NLP
TLDR
An evaluation framework consisting of a set of automatic evaluation metrics and human evaluation guidelines, to rigorously assess the quality of adversarial examples based on the aforementioned properties is proposed.
From Hero to Zéroe: A Benchmark of Low-Level Adversarial Attacks
TLDR
This work proposes the first large-scale catalogue and benchmark of low-level adversarial attacks, which it dubs Zéroe, encompassing nine different attack modes including visual and phonetic adversaries, and shows that RoBERTa, NLP’s current workhorse, fails on attacks.
Poison Attacks against Text Datasets with Conditional Adversarially Regularized Autoencoder
TLDR
Results show that a victim BERT finetuned classifier’s predictions can be steered to the poison target class with success rates of >80\% when the input hypothesis is injected with the poison signature, demonstrating that NLI and text classification systems face a huge security risk.
Models in the Wild: On Corruption Robustness of Neural NLP Systems
TLDR
This paper introduces WildNLP - a framework for testing model stability in a natural setting where text corruptions such as keyboard errors or misspelling occur, and compares robustness of deep learning models from 4 popular NLP tasks by testing their performance on aspects introduced in the framework.
TextDecepter: Hard Label Black Box Attack on Text Classifiers
TLDR
This paper presents a novel approach for hard label black-box attacks against Natural Language Processing (NLP) classifiers, where no model information is disclosed, and an attacker can only query the model to get final decision of the classifier, without confidence scores of the classes involved.
Reevaluating Adversarial Examples in Natural Language
TLDR
This work distill ideas from past work into a unified framework: a successful natural language adversarial example is a perturbation that fools the model and follows some linguistic constraints, and analyzes the outputs of two state-of-the-art synonym substitution attacks.
Models in the Wild : On Corruption Robustness of NLP Systems
  • Computer Science
  • 2019
TLDR
This paper introduces WildNLP a framework for testing model stability in a natural setting where text corruptions such as keyboard errors or misspelling occur, and compares robustness of models from 4 popular NLP tasks by testing their performance on aspects introduced in the framework.
Toward Mitigating Adversarial Texts
TLDR
This paper proposes a defense against black-box adversarial attacks using a spell-checking system that utilizes frequency and contextual information for correction of nonword misspellings and outperforms six of the publicly available, state-of-the-art spelling correction tools.
TextBugger: Generating Adversarial Text Against Real-world Applications
TLDR
This paper presents TextBugger, a general attack framework for generating adversarial texts, and empirically evaluates its effectiveness, evasiveness, and efficiency on a set of real-world DLTU systems and services used for sentiment analysis and toxic content detection.
...
...

References

SHOWING 1-10 OF 27 REFERENCES
Explaining and Harnessing Adversarial Examples
TLDR
It is argued that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature, supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets.
Adversarial Training Methods for Semi-Supervised Text Classification
TLDR
This work extends adversarial and virtual adversarial training to the text domain by applying perturbations to the word embeddings in a recurrent neural network rather than to the original input itself.
Generating Natural Adversarial Examples
TLDR
This paper proposes a framework to generate natural and legible adversarial examples that lie on the data manifold, by searching in semantic space of dense and continuous data representation, utilizing the recent advances in generative adversarial networks.
Adversarial Example Generation with Syntactically Controlled Paraphrase Networks
TLDR
A combination of automated and human evaluations show that SCPNs generate paraphrases that follow their target specifications without decreasing paraphrase quality when compared to baseline (uncontrolled) paraphrase systems.
The Limitations of Deep Learning in Adversarial Settings
TLDR
This work formalizes the space of adversaries against deep neural networks (DNNs) and introduces a novel class of algorithms to craft adversarial samples based on a precise understanding of the mapping between inputs and outputs of DNNs.
Towards Deep Learning Models Resistant to Adversarial Attacks
TLDR
This work studies the adversarial robustness of neural networks through the lens of robust optimization, and suggests the notion of security against a first-order adversary as a natural and broad security guarantee.
Towards Evaluating the Robustness of Neural Networks
TLDR
It is demonstrated that defensive distillation does not significantly increase the robustness of neural networks, and three new attack algorithms are introduced that are successful on both distilled and undistilled neural networks with 100% probability are introduced.
Adversarial Examples for Evaluating Reading Comprehension Systems
TLDR
This work proposes an adversarial evaluation scheme for the Stanford Question Answering Dataset that tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences without changing the correct answer or misleading humans.
Synthetic and Natural Noise Both Break Neural Machine Translation
TLDR
It is found that a model based on a character convolutional neural network is able to simultaneously learn representations robust to multiple kinds of noise, including structure-invariant word representations and robust training on noisy texts.
Learning Robust Representations of Text
TLDR
Empirical evaluation over a range of sentiment datasets with a convolutional neural network shows that the regularization based method achieves superior performance over noisy inputs and out-of-domain data.
...
...