CAT-Gen: Improving Robustness in NLP Models via Controlled Adversarial Text Generation

  title={CAT-Gen: Improving Robustness in NLP Models via Controlled Adversarial Text Generation},
  author={Tianlu Wang and Xuezhi Wang and Yao Qin and Ben Packer and Kang Li and Jilin Chen and Alex Beutel and Ed H. Chi},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
NLP models are shown to suffer from robustness issues, i.e., a model's prediction can be easily changed under small perturbations to the input. In this work, we present a Controlled Adversarial Text Generation (CAT-Gen) model that, given an input text, generates adversarial texts through controllable attributes that are known to be invariant to task labels. For example, in order to attack a model for sentiment classification over product reviews, we can use the product categories as the… 

Figures and Tables from this paper

KATG: Keyword-Bias-Aware Adversarial Text Generation for Text Classification

A Keyword-bias-aware Adversarial Text Generation model that implicitly generates adversarial sentences using a generator-discriminator structure that can strengthen the victim model's robustness and generalization is proposed.

Adversarial Training for Improving Model Robustness? Look at Both Prediction and Interpretation

FLAT explicitly shoots at the vulnerability problem caused by the mismatch between model understandings on the replaced words and their synonyms in original/adversarial example pairs by regularizing the corresponding global word importance scores.

ValCAT: Variable-Length Contextualized Adversarial Transformations Using Encoder-Decoder Language Model

ValCAT is a black-box attack framework that misleads the language model by applying variable-length contextualized transformations to the original text by expanding the basic units of perturbation from single words to spans composed of multiple consecutive words, enhancing the perturbations capability.

Quantifying the Performance of Adversarial Training on Language Models with Distribution Shifts

This paper examines the limitations of adversarial training due to the temporal changes of machine learning models using a natural language task and shows that certain adversarially-trained models are even more prone to the drift than others.

Better Robustness by More Coverage: Adversarial Training with Mixup Augmentation for Robust Fine-tuning

Adversarial Data Augmentation with Mixup linearly interpolates the representations of pairs of training examples to form new virtual samples, which are more abundant and diverse than the discrete adversarial examples used in conventional ADA.

Multi-granularity Textual Adversarial Attack with Behavior Cloning

This paper proposes MAYA, a Multi-grAnularitY Attack model to effectively generate high-quality adversarial samples with fewer queries to victim models and proposes a reinforcement-learning based method to train a multi-granularity attack agent through behavior cloning with the expert knowledge from the MAYA algorithm to further reduce the query times.

Text Adversarial Attacks and Defenses: Issues, Taxonomy, and Perspectives

This work introduces the pipeline of NLP, including the vector representations of text, DNN-based victim models, and a formal definition of adversarial attacks, which makes the review self-contained.

TREATED: Towards Universal Defense against Textual Adversarial Attacks

TREATED is proposed, a universal adversarial detection method that can defend against attacks of various perturbation levels without making any assumptions, and achieves better detection performance than baselines.

Making Adversarially-Trained Language Models Forget with Model Retraining: A Case Study on Hate Speech Detection

The findings indicate that adversarial training is highly task-dependent as well as dataset dependent as models trained on the same dataset achieve high prediction accuracy but fare poorly when tested with new dataset even after retraining models with adversarial examples.

Phrase-level Textual Adversarial Attack with Label Preservation

This paper poses PLAT that generates adversarial samples through phrase-level perturbations, and develops a label-preservation technique tuned on each class to rule out those perturbs that potentially alter the original class 024 label for humans.



FreeLB: Enhanced Adversarial Training for Language Understanding

A novel adversarial training algorithm - FreeLB - is proposed, that promotes higher robustness and invariance in the embedding space, by adding adversarial perturbations to word embeddings and minimizing the resultant adversarial risk inside different regions around input samples.

Certified Robustness to Adversarial Word Substitutions

This paper trains the first models that are provably robust to all word substitutions in this exponentially large family of label-preserving transformations, in which every word in the input can be replaced with a similar word.

Generating Natural Language Adversarial Examples

A black-box population-based optimization algorithm is used to generate semantically and syntactically similar adversarial examples that fool well-trained sentiment analysis and textual entailment models with success rates of 97% and 70%, respectively.

Generating Natural Adversarial Examples

This paper proposes a framework to generate natural and legible adversarial examples that lie on the data manifold, by searching in semantic space of dense and continuous data representation, utilizing the recent advances in generative adversarial networks.

Is BERT Really Robust? Natural Language Attack on Text Classification and Entailment

The TextFooler is presented, a general attack framework, to generate natural adversarial texts that outperforms state-of-the-art attacks in terms of success rate and perturbation rate.

Explaining and Harnessing Adversarial Examples

It is argued that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature, supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets.

Adversarial Domain Adaptation for Machine Reading Comprehension

An Adversarial Domain Adaptation framework for Machine Reading Comprehension (MRC), where pseudo questions are first generated for unlabeled passages in the target domain, and then a domain classifier is incorporated into an MRC model to predict which domain a given passage-question pair comes from.

Adversarial Example Generation with Syntactically Controlled Paraphrase Networks

A combination of automated and human evaluations show that SCPNs generate paraphrases that follow their target specifications without decreasing paraphrase quality when compared to baseline (uncontrolled) paraphrase systems.

Plug and Play Language Models: A Simple Approach to Controlled Text Generation

The Plug and Play Language Model (PPLM) for controllable language generation is proposed, which combines a pretrained LM with one or more simple attribute classifiers that guide text generation without any further training of the LM.

HotFlip: White-Box Adversarial Examples for Text Classification

An efficient method to generate white-box adversarial examples to trick a character-level neural classifier based on an atomic flip operation, which swaps one token for another, based on the gradients of the one-hot input vectors is proposed.