Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment

@inproceedings{Jin2019IsBR,
  title={Is BERT Really Robust? A Strong Baseline for Natural Language Attack on Text Classification and Entailment},
  author={Di Jin and Zhijing Jin and Joey Tianyi Zhou and Peter Szolovits},
  booktitle={AAAI Conference on Artificial Intelligence},
  year={2019}
}
Machine learning algorithms are often vulnerable to adversarial examples that have imperceptible alterations from the original counterparts but can fool the state-of-the-art models. It is helpful to evaluate or even improve the robustness of these models by exposing the maliciously crafted adversarial examples. In this paper, we present TextFooler, a simple but strong baseline to generate adversarial text. By applying it to two fundamental natural language tasks, text classification and textual… 

Robust Neural Text Classification and Entailment via Mixup Regularized Adversarial Training

This work proposes mixup regularized adversarial training (MRAT) against multi-level attack, which can utilize multiple adversarial examples to increase model intrinsic robustness without sacrificing the performance on normal data.

Detecting Textual Adversarial Examples Based on Distributional Characteristics of Data Representations

Two new reactive methods for NLP to fill the gap of effective general reactive approaches to defence via detection of textual adversarial examples such as is found in the image processing literature are proposed.

TextDefense: Adversarial Text Detection based on Word Importance Entropy

This paper exhaustively investigates the adversarial attack algorithms in NLP and proposes TextDefense, a new adversarial example detection framework that utilizes the target model's capability to defend against adversarial attacks while requiring no prior knowledge.

Contrasting Human- and Machine-Generated Word-Level Adversarial Examples for Text Classification

This work reports on crowdsourcing studies in which humans are tasked with iteratively modifying words in an input text, while receiving immediate model feedback, with the aim of causing a sentiment classification model to misclassify the example.

Detection of Adversarial Examples in Text Classification: Benchmark and Baseline via Robust Density Estimation

A dataset for four popular attack methods on four datasets and four models and a competitive baseline based on density estimation that has the highest auc on 29 out of 30 dataset-attack-model combinations is proposed.

Textual Adversarial Attacking with Limited Queries

A novel attack method is proposed, the main idea of which is to fully utilize the adversarial examples generated by the local model and transfer part of the attack to the localmodel to complete ahead of time, thereby reducing costs related to attacking the target model.

A Differentiable Language Model Adversarial Attack on Text Classifiers

A new black-box sentence-level attack that fine-tunes a pre-trained language model to generate adversarial examples that outperforms competitors on a diverse set of NLP problems for both computed metrics and human evaluation.

T3: Tree-Autoencoder Regularized Adversarial Text Generation for Targeted Attack

T3 generated adversarial texts can successfully manipulate the NLP models to output the targeted incorrect answer without misleading the human and have high transferability which enables the black-box attacks in practice.

T3: Tree-Autoencoder Constrained Adversarial Text Generation for Targeted Attack

T3 generated adversarial texts can successfully manipulate the NLP models to output the targeted incorrect answer without misleading the human and have high transferability which enables the black-box attacks in practice.

Rethinking Textual Adversarial Defense for Pre-Trained Language Models

A universal defense framework is designed, which is among the first to perform textual adversarial defense without knowing the specific attack and achieves comparable or even higher after-attack accuracy with other specific defenses, while preserving higher original accuracy at the same time.
...

TextBugger: Generating Adversarial Text Against Real-world Applications

This paper presents TextBugger, a general attack framework for generating adversarial texts, and empirically evaluates its effectiveness, evasiveness, and efficiency on a set of real-world DLTU systems and services used for sentiment analysis and toxic content detection.

Generating Natural Language Adversarial Examples

A black-box population-based optimization algorithm is used to generate semantically and syntactically similar adversarial examples that fool well-trained sentiment analysis and textual entailment models with success rates of 97% and 70%, respectively.

Black-Box Generation of Adversarial Text Sequences to Evade Deep Learning Classifiers

A novel algorithm is presented, DeepWordBug, to effectively generate small text perturbations in a black-box setting that forces a deep-learning classifier to misclassify a text input.

Deep Text Classification Can be Fooled

An effective method to craft text adversarial samples that can successfully fool both state-of-the-art character-level and word-level DNN-based text classifiers and is difficult to be perceived.

Generating Natural Adversarial Examples

This paper proposes a framework to generate natural and legible adversarial examples that lie on the data manifold, by searching in semantic space of dense and continuous data representation, utilizing the recent advances in generative adversarial networks.

HotFlip: White-Box Adversarial Examples for Text Classification

An efficient method to generate white-box adversarial examples to trick a character-level neural classifier based on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors is proposed.

Semantically Equivalent Adversarial Rules for Debugging NLP models

This work presents semantically equivalent adversaries (SEAs) – semantic-preserving perturbations that induce changes in the model’s predictions that induce adversaries on many instances that are extremely similar semantically.

Towards Deep Learning Models Resistant to Adversarial Attacks

This work studies the adversarial robustness of neural networks through the lens of robust optimization, and suggests the notion of security against a first-order adversary as a natural and broad security guarantee.

Adversarial Machine Learning at Scale

This research applies adversarial training to ImageNet and finds that single-step attacks are the best for mounting black-box attacks, and resolution of a "label leaking" effect that causes adversarially trained models to perform better on adversarial examples than on clean examples.