Analyzing Dynamic Adversarial Training Data in the Limit

  title={Analyzing Dynamic Adversarial Training Data in the Limit},
  author={Eric Wallace and Adina Williams and Robin Jia and Douwe Kiela},
To create models that are robust across a wide range of test inputs, training datasets should include diverse examples that span numerous phenomena. Dynamic adversarial data collection (DADC), where annotators craft examples that challenge continually improving models, holds promise as an approach for generating such diverse training sets. Prior work has shown that running DADC over 1-3 rounds can help models fix some error types, but it does not necessarily lead to better generalization beyond… 

Figures and Tables from this paper

Adversarially Constructed Evaluation Sets Are More Challenging, but May Not Be Fair

This work studies the impact of applying three common approaches for adversarial dataset creation: filtering out easy examples, perturbing examples, and model-in-the-loop data collection (ANLI and AdversarialQA), across 18 different adversary models.

Models in the Loop: Aiding Crowdworkers with Generative Annotation Assistants

Generative Annotation Assistants (GAAs) are introduced, generator-in-the-loop models that provide real-time suggestions that annotators can either approve, modify, or reject entirely and are found to lead to higher downstream model performance on a variety of question answering tasks over adversarial data collection.

Overconfidence in the Face of Ambiguity with Adversarial Data

This work investigates whether models trained on adversarially-collected data are miscalibrated with respect to the ambiguity of their inputs, and finds no clear difference in accuracy between naturalistically and adversARially trained models.

Adversarial Training for High-Stakes Reliability

This work created a series of adversarial training techniques—including a tool that assists human adversaries—to create and eliminate failures in a classifier that filters text completions suggested by a generator, and found that adversarialTraining increased robustness to the adversarial attacks that it trained on, without affecting in-distribution performance.

ANLIzing the Adversarial Natural Language Inference Dataset

We perform an in-depth error analysis of Adversarial NLI (ANLI), a recently introduced large-scale human-and-model-in-the-loop natural language inference dataset collected over multiple rounds. We

Benchmarking Long-tail Generalization with Likelihood Splits

This work proposes a method to create challenging benchmarks that require generalizing to the tail of the distribution by re-splitting existing datasets by creating ‘Likeli-hood splits’ where examples that are assigned lower likelihood by a pre-trained language model are placed in the test set, and more likely examples are in the training set.

Red Teaming Language Models with Language Models

This work automatically finds cases where a target LM behaves in a harmful way, by generating test cases (“red teaming”) using another LM, and evaluates the target LM’s replies to generated test questions using a classifier trained to detect offensive content.

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned

It is found that the RLHF models are increasinglycult to red team as they scale, and a trend with scale for the other model types is found, which indicates that this transparency accelerates the ability to work together as a community in order to develop shared norms, practices, and technical standards.

Dynatask: A Framework for Creating Dynamic AI Benchmark Tasks

We introduce Dynatask: an open source system for setting up custom NLP tasks that aims to greatly lower the technical knowledge and effort required for hosting and evaluating state-of-the-art NLP



On the Efficacy of Adversarial Data Collection for Question Answering: Results from a Large-Scale Randomized Study

Across a variety of models and datasets, it is found that models trained on adversarial data usually perform better on other adversarial datasets but worse on a diverse collection of out-of-domain evaluation sets.

Explaining and Harnessing Adversarial Examples

It is argued that the primary cause of neural networks' vulnerability to adversarial perturbation is their linear nature, supported by new quantitative results while giving the first explanation of the most intriguing fact about them: their generalization across architectures and training sets.

Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension

This work investigates this annotation methodology and applies it in three different settings, collecting a total of 36,000 samples with progressively stronger models in the annotation loop, finding that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop.

Adversarial Filters of Dataset Biases

This work presents extensive supporting evidence that AFLite is broadly applicable for reduction of measurable dataset biases, and that models trained on the filtered datasets yield better generalization to out-of-distribution tasks.

Adversarial NLI: A New Benchmark for Natural Language Understanding

This work introduces a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure, and shows that non-expert annotators are successful at finding their weaknesses.

SWAG: A Large-Scale Adversarial Dataset for Grounded Commonsense Inference

This paper introduces the task of grounded commonsense inference, unifying natural language inference and commonsense reasoning, and proposes Adversarial Filtering (AF), a novel procedure that constructs a de-biased dataset by iteratively training an ensemble of stylistic classifiers, and using them to filter the data.

What Will it Take to Fix Benchmarking in Natural Language Understanding?

It is argued most current benchmarks fail at these criteria, and that adversarially-constructed, out-of-distribution test sets does not meaningfully address the causes of these failures.

Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data

It is found that models trained on a counterfactually-augmented SNLI dataset do not generalize better than unaugmenting datasets of similar size and that counterfactual augmentation can hurt performance, yielding models that are less robust to challenge examples.

Semantically Equivalent Adversarial Rules for Debugging NLP models

This work presents semantically equivalent adversaries (SEAs) – semantic-preserving perturbations that induce changes in the model’s predictions that induce adversaries on many instances that are extremely similar semantically.

New Protocols and Negative Results for Textual Entailment Data Collection

Four alternative protocols are proposed, each aimed at improving either the ease with which annotators can produce sound training examples or the quality and diversity of those examples, and it is observed that all four new protocols reduce previously observed issues with annotation artifacts.