DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

@inproceedings{Dua2019DROPAR,
  title={DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs},
  author={Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner},
  booktitle={NAACL},
  year={2019}
}
Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 55k-question benchmark, a system must resolve references in a question… 

Figures and Tables from this paper

On Making Reading Comprehension More Comprehensive
TLDR
This work justifies a question answering approach to reading comprehension and describes the various kinds of questions one might use to more fully test a system’s comprehension of a passage, moving beyond questions that only probe local predicate-argument structures.
BiQuAD: Towards QA based on deeper text understanding
TLDR
This work introduces a new dataset called BiQuAD that requires deeper comprehension in order to answer questions in both extractive and deductive fashion and shows that state-of-the-art QA models do not perform well on the challenging long form contexts and reasoning requirements posed by the dataset.
Comprehensive Multi-Dataset Evaluation of Reading Comprehension
TLDR
An evaluation server, ORB, is presented, that reports performance on seven diverse reading comprehension datasets, encouraging and facilitating testing a single model’s capability in understanding a wide variety of reading phenomena.
ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning
TLDR
A new Reading Comprehension dataset requiring logical reasoning (ReClor) extracted from standardized graduate admission examinations is introduced, and it is proposed to identify biased data points and separate them into EASY set while the rest as HARD set in order to comprehensively evaluate the logical reasoning ability of models on ReClor.
MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics
TLDR
A Learned Evaluation metric for Reading Comprehension, LERC, is trained to mimic human judgement scores, which achieves 80% accuracy and outperforms baselines by 14 to 26 absolute percentage points while leaving significant room for improvement.
Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks
TLDR
QuAIL is presented, the first RC dataset to combine text-based, world knowledge and unanswerable questions, and to provide question type annotation that would enable diagnostics of the reasoning strategies by a given QA system.
Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning
TLDR
This work presents a new crowdsourced dataset containing more than 24K span-selection questions that require resolving coreference among entities in over 4.7K English paragraphs from Wikipedia, and shows that state-of-the-art reading comprehension models perform significantly worse than humans on this benchmark.
A Simple and Effective Model for Answering Multi-span Questions
TLDR
This work suggests a new approach for tackling multi-span questions, based on sequence tagging, which differs from previous approaches for answering span questions, and shows that this approach leads to an absolute improvement and slightly eclipses the current state-of-the-art results on the entire DROP dataset.
A Multi-Type Multi-Span Network for Reading Comprehension that Requires Discrete Reasoning
TLDR
The Multi-Type Multi-Span Network (MTMSN) is introduced, a neural reading comprehension model that combines a multi-type answer predictor designed to support various answer types with amulti-span extraction method for dynamically producing one or multiple text spans.
IIRC: A Dataset of Incomplete Information Reading Comprehension Questions
TLDR
A dataset with more than 13K questions over paragraphs from English Wikipedia that provide only partial information to answer them, with the missing information occurring in one or more linked documents, finding that it achieves 31.1% F1 on this task, while estimated human performance is 88.4%.
...
...

References

SHOWING 1-10 OF 57 REFERENCES
Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences
TLDR
The dataset is the first to study multi-sentence inference at scale, with an open-ended set of question types that requires reasoning skills, and finds human solvers to achieve an F1-score of 88.1%.
How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks
TLDR
Sensible baselines are established for the bAbI, SQuAD, CBT, CNN, and Who-did-What datasets, finding that question- and passage-only models often perform surprisingly well.
The NarrativeQA Reading Comprehension Challenge
TLDR
A new dataset and set of tasks in which the reader must answer questions about stories by reading entire books or movie scripts are presented, designed so that successfully answering their questions requires understanding the underlying narrative rather than relying on shallow pattern matching or salience.
A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task
TLDR
A thorough examination of this new reading comprehension task by creating over a million training examples by pairing CNN and Daily Mail news articles with their summarized bullet points, and showing that a neural network can be trained to give good performance on this task.
Constructing Datasets for Multi-hop Reading Comprehension Across Documents
TLDR
A novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods, in which a model learns to seek and combine evidence — effectively performing multihop, alias multi-step, inference.
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
TLDR
It is shown that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers.
SQuAD: 100,000+ Questions for Machine Comprehension of Text
TLDR
A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).
Simple and Effective Multi-Paragraph Reading Comprehension
TLDR
It is shown that it is possible to significantly improve performance by using a modified training scheme that teaches the model to ignore non-answer containing paragraphs, which involves sampling multiple paragraphs from each document, and using an objective function that requires themodel to produce globally correct output.
Know What You Don’t Know: Unanswerable Questions for SQuAD
TLDR
SQuadRUn is a new dataset that combines the existing Stanford Question Answering Dataset (SQuAD) with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.
Bidirectional Attention Flow for Machine Comprehension
TLDR
The BIDAF network is introduced, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization.
...
...