Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning

@inproceedings{Dasigi2019QuorefAR,
  title={Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning},
  author={Pradeep Dasigi and Nelson F. Liu and Ana Marasovi{\'c} and Noah A. Smith and Matt Gardner},
  booktitle={EMNLP},
  year={2019}
}
Machine comprehension of texts longer than a single sentence often requires coreference resolution. [] Key Method We deal with this issue by using a strong baseline model as an adversary in the crowdsourcing loop, which helps crowdworkers avoid writing questions with exploitable surface cues. We show that state-of-the-art reading comprehension models perform significantly worse than humans on this benchmark—the best model performance is 70.5 F1, while the estimated human performance is 93.4 F1.

Figures and Tables from this paper

Coreference Reasoning in Machine Reading Comprehension
TLDR
A methodology for creating MRC datasets that better reflect the challenges of coreference reasoning and an effective way to use naturally occurring coreference phenomena from existing coreference resolution datasets when training MRC models are proposed.
On Making Reading Comprehension More Comprehensive
TLDR
This work justifies a question answering approach to reading comprehension and describes the various kinds of questions one might use to more fully test a system’s comprehension of a passage, moving beyond questions that only probe local predicate-argument structures.
IIRC: A Dataset of Incomplete Information Reading Comprehension Questions
TLDR
A dataset with more than 13K questions over paragraphs from English Wikipedia that provide only partial information to answer them, with the missing information occurring in one or more linked documents, finding that it achieves 31.1% F1 on this task, while estimated human performance is 88.4%.
Tracing Origins: Coreference-aware Machine Reading Comprehension
TLDR
This paper imitate the human reading process in connecting the anaphoric expressions and explicitly leverage the coreference information of the entities to enhance the word embeddings from the pre-trained language model, in order to highlight thecoreference mentions of the entity that must be identified for coreference-intensive question answering in QUOREF.
Evaluation of Single-Span Models on Extractive Multi-Span Question-Answering
TLDR
This work introduces a newly compiled dataset consisting of questions with multiple answers that originate from previously existing datasets, and runs BERT-based models pre-trained for question-answering on the authors' constructed dataset to evaluate their reading comprehension abilities.
Coreference Resolution as Query-based Span Prediction
TLDR
An accurate and extensible approach for the coreference resolution task, formulated as a span prediction task, like in machine reading comprehension (MRC), which provides the flexibility of retrieving mentions left out at the mention proposal stage.
Tracing Origins: Coref-aware Machine Reading Comprehension
TLDR
This paper imitated the human’s reading process in connecting the anaphoric expressions and explicitly leverage the coreference information to enhance the word embeddings from the pre-trained model, in order to highlight thecoreference mentions that must be identified for coreference-intensive question answering in QUOREF, a relatively new dataset that is specifically designed to evaluate the core conference-related performance of a model.
TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions
TLDR
TORQUE is introduced, a new English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships, and results show that RoBERTa-large achieves an exact-match score of 51% on the test set of TORQUE, about 30% behind human performance.
MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics
TLDR
A Learned Evaluation metric for Reading Comprehension, LERC, is trained to mimic human judgement scores, which achieves 80% accuracy and outperforms baselines by 14 to 26 absolute percentage points while leaving significant room for improvement.
Why Machine Reading Comprehension Models Learn Shortcuts?
TLDR
It is argued that larger proportion of shortcut questions in training data make models rely on shortcut tricks excessively, and two new methods are proposed to quantitatively analyze the learning difficulty regarding shortcut and challenging questions, and revealing the inherent learning mechanism behind the different performance between the two kinds of questions.
...
...

References

SHOWING 1-10 OF 32 REFERENCES
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
TLDR
A new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs, and presents a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1.
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
TLDR
It is shown that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers.
MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text
TLDR
MCTest is presented, a freely available set of stories and associated questions intended for research on the machine comprehension of text that requires machines to answer multiple-choice reading comprehension questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension.
How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks
TLDR
Sensible baselines are established for the bAbI, SQuAD, CBT, CNN, and Who-did-What datasets, finding that question- and passage-only models often perform surprisingly well.
SQuAD: 100,000+ Questions for Machine Comprehension of Text
TLDR
A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).
PreCo: A Large-scale Dataset in Preschool Vocabulary for Coreference Resolution
TLDR
Experiments show that with higher training-test overlap, error analysis on PreCo is more efficient than the one on OntoNotes, a popular existing dataset and annotate singleton mentions making it possible for the first time to quantify the influence that a mention detector makes on coreference resolution performance.
Removing the Training Wheels: A Coreference Dataset that Entertains Humans and Challenges Computers
TLDR
This work uses the quiz bowl community to develop a new coreference dataset, together with an annotation framework that can tag any text data with coreferences and named entities, and successfully integrates active learning into this annotation pipeline to collect documents maximally useful to coreference models.
Simple and Effective Multi-Paragraph Reading Comprehension
TLDR
It is shown that it is possible to significantly improve performance by using a modified training scheme that teaches the model to ignore non-answer containing paragraphs, which involves sampling multiple paragraphs from each document, and using an objective function that requires themodel to produce globally correct output.
A Crowdsourced Corpus of Multiple Judgments and Disagreement on Anaphoric Interpretation
TLDR
The corpus, containing annotations for about 108,000 markables, is one of the largest corpora for coreference for English, and one the largest crowdsourced NLP corpora, but its main feature is the large number of judgments per markable, which makes it a unique resource for the study of disagreements on anaphoric interpretation.
Annotation Artifacts in Natural Language Inference Data
TLDR
It is shown that a simple text categorization model can correctly classify the hypothesis alone in about 67% of SNLI and 53% of MultiNLI, and that specific linguistic phenomena such as negation and vagueness are highly correlated with certain inference classes.
...
...