On the Robustness of Reading Comprehension Models to Entity Renaming

  title={On the Robustness of Reading Comprehension Models to Entity Renaming},
  author={Jun Yan and Yang Xiao and Sagnik Mukherjee and Bill Yuchen Lin and Robin Jia and Xiang Ren},
  booktitle={North American Chapter of the Association for Computational Linguistics},
We study the robustness of machine reading comprehension (MRC) models to entity renaming—do models make more wrong predictions when the same questions are asked about an entity whose name has been changed? Such failures imply that models overly rely on entity information to answer questions, and thus may generalize poorly when facts about the world change or questions are asked about novel entities. To systematically audit this issue, we present a pipeline to automatically generate test… 

Honest Students from Untrusted Teachers: Learning an Interpretable Question-Answering Pipeline from a Pretrained Language Model

A new way to build trustworthy pipeline systems from a combination of end-task annotations and frozen pretrained language models, called markup-and-mask, which combines aspects of extractive and free-text explanations.



Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets

Results suggest that most of the questions already answered correctly by the model do not necessarily require grammatical and complex reasoning, and therefore, MRC datasets will need to take extra care in their design to ensure that questions can correctly evaluate the intended skills.

Benchmarking Robustness of Machine Reading Comprehension Models

AdvRACE (Adversarial RACE) is constructed, a new model-agnostic benchmark for evaluating the robustness of MRC models under six different types of test-time perturbations, including the novel superimposed attack and distractor construction attack.

Why Machine Reading Comprehension Models Learn Shortcuts?

It is argued that larger proportion of shortcut questions in training data make models rely on shortcut tricks excessively, and two new methods are proposed to quantitatively analyze the learning difficulty regarding shortcut and challenging questions, and revealing the inherent learning mechanism behind the different performance between the two kinds of questions.

Are you tough enough? Framework for Robustness Validation of Machine Comprehension Systems

This paper proposes a framework which validates robustness of any Question Answering model through model explainers, and proposes that a robust model should transgress the initial notion of semantic similarity induced by word embeddings to learn a more human-like understanding of meaning.

Adversarial Examples for Evaluating Reading Comprehension Systems

This work proposes an adversarial evaluation scheme for the Stanford Question Answering Dataset that tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences without changing the correct answer or misleading humans.

RockNER: A Simple Method to Create Adversarial Examples for Evaluating the Robustness of Named Entity Recognition Models

To audit the robustness of named entity recognition (NER) models, RockNER is proposed, a simple yet effective method to create natural adversarial examples that result in a shifted distribution from the training data on which the target models have been trained.

Syntactic Data Augmentation Increases Robustness to Inference Heuristics

The best-performing augmentation method, subject/object inversion, improved BERT’s accuracy on controlled examples that diagnose sensitivity to word order from 0.28 to 0.73, suggesting that augmentation causes BERT to recruit abstract syntactic representations.

What Makes Reading Comprehension Questions Easier?

This study proposes to employ simple heuristics to split each dataset into easy and hard subsets and examines the performance of two baseline models for each of the subsets, and observes that the baseline performances for thehard subsets remarkably degrade compared to those of entire datasets.

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

It is shown that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers.

UnifiedQA: Crossing Format Boundaries With a Single QA System

This work uses the latest advances in language modeling to build a single pre-trained QA model, UNIFIEDQA, that performs well across 19 QA datasets spanning 4 diverse formats, and results in a new state of the art on 10 factoid and commonsense question answering datasets.