longhorns at DADC 2022: How many linguists does it take to fool a Question Answering model? A systematic approach to adversarial attacks.

@article{Kovatchev2022longhornsAD,
  title={longhorns at DADC 2022: How many linguists does it take to fool a Question Answering model? A systematic approach to adversarial attacks.},
  author={Venelin Kovatchev and Trina Chatterjee and Venkata Subrahmanyan Govindarajan and Jifan Chen and Eunsol Choi and Gabriella Chronis and Anubrata Das and Katrin Erk and Matthew Lease and Junyi Jessy Li and Yating Wu and Kyle Mahowald},
  journal={ArXiv},
  year={2022},
  volume={abs/2206.14729}
}
Developing methods to adversarially challenge NLP systems is a promising avenue for improving both model performance and interpretability. Here, we describe the approach of the team “longhorns” on Task 1 of the The First Workshop on Dynamic Adversarial Data Collection (DADC), which asked teams to manually fool a model on an Extractive Question Answering task. Our team finished first (pending validation), with a model error rate of 62%. We advocate for a systematic, linguistically informed… 

Figures from this paper

References

SHOWING 1-10 OF 46 REFERENCES

Trick Me If You Can: Human-in-the-Loop Generation of Adversarial Examples for Question Answering

TLDR
This work proposes human- in-the-loop adversarial generation, where human authors are guided to break models through an interactive user interface, and applies this generation framework to a question answering task called Quizbowl, where trivia enthusiasts craft adversarial questions.

ANLIzing the Adversarial Natural Language Inference Dataset

We perform an in-depth error analysis of Adversarial NLI (ANLI), a recently introduced large-scale human-and-model-in-the-loop natural language inference dataset collected over multiple rounds. We

SQuAD: 100,000+ Questions for Machine Comprehension of Text

TLDR
A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).

Analyzing Compositionality-Sensitivity of NLI Models

TLDR
This work proposes a compositionality-sensitivity testing setup that analyzes models on natural examples from existing datasets that cannot be solved via lexical features alone, hence revealing the models' actual compositionality awareness.

Beat the AI: Investigating Adversarial Human Annotation for Reading Comprehension

TLDR
This work investigates this annotation methodology and applies it in three different settings, collecting a total of 36,000 samples with progressively stronger models in the annotation loop, finding that stronger models can still learn from datasets collected with substantially weaker models-in-the-loop.

Stress Test Evaluation for Natural Language Inference

TLDR
This work proposes an evaluation methodology consisting of automatically constructed “stress tests” that allow us to examine whether systems have the ability to make real inferential decisions, and reveals strengths and weaknesses of these models with respect to challenging linguistic phenomena.

On the Efficacy of Adversarial Data Collection for Question Answering: Results from a Large-Scale Randomized Study

TLDR
Across a variety of models and datasets, it is found that models trained on adversarial data usually perform better on other adversarial datasets but worse on a diverse collection of out-of-domain evaluation sets.

Dynabench: Rethinking Benchmarking in NLP

TLDR
It is argued that Dynabench addresses a critical need in the community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios.

Beyond Accuracy: Behavioral Testing of NLP Models with CheckList

Although measuring held-out accuracy has been the primary approach to evaluate generalization, it often overestimates the performance of NLP models, while alternative approaches for evaluating models

oLMpics-On What Language Model Pre-training Captures

TLDR
This work proposes eight reasoning tasks, which conceptually require operations such as comparison, conjunction, and composition, and findings can help future work on designing new datasets, models, and objective functions for pre-training.