DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
@inproceedings{Dua2019DROPAR, title={DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs}, author={Dheeru Dua and Yizhong Wang and Pradeep Dasigi and Gabriel Stanovsky and Sameer Singh and Matt Gardner}, booktitle={NAACL}, year={2019} }
Reading comprehension has recently seen rapid progress, with systems matching humans on the most popular datasets for the task. However, a large body of work has highlighted the brittleness of these systems, showing that there is much work left to be done. We introduce a new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs. In this crowdsourced, adversarially-created, 55k-question benchmark, a system must resolve references in a question…
352 Citations
On Making Reading Comprehension More Comprehensive
- Computer ScienceEMNLP
- 2019
This work justifies a question answering approach to reading comprehension and describes the various kinds of questions one might use to more fully test a system’s comprehension of a passage, moving beyond questions that only probe local predicate-argument structures.
BiQuAD: Towards QA based on deeper text understanding
- Computer ScienceSTARSEM
- 2021
This work introduces a new dataset called BiQuAD that requires deeper comprehension in order to answer questions in both extractive and deductive fashion and shows that state-of-the-art QA models do not perform well on the challenging long form contexts and reasoning requirements posed by the dataset.
Comprehensive Multi-Dataset Evaluation of Reading Comprehension
- Computer ScienceEMNLP
- 2019
An evaluation server, ORB, is presented, that reports performance on seven diverse reading comprehension datasets, encouraging and facilitating testing a single model’s capability in understanding a wide variety of reading phenomena.
ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning
- Computer ScienceICLR
- 2020
A new Reading Comprehension dataset requiring logical reasoning (ReClor) extracted from standardized graduate admission examinations is introduced, and it is proposed to identify biased data points and separate them into EASY set while the rest as HARD set in order to comprehensively evaluate the logical reasoning ability of models on ReClor.
MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics
- Computer ScienceEMNLP
- 2020
A Learned Evaluation metric for Reading Comprehension, LERC, is trained to mimic human judgement scores, which achieves 80% accuracy and outperforms baselines by 14 to 26 absolute percentage points while leaving significant room for improvement.
Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks
- Computer ScienceAAAI
- 2020
QuAIL is presented, the first RC dataset to combine text-based, world knowledge and unanswerable questions, and to provide question type annotation that would enable diagnostics of the reasoning strategies by a given QA system.
Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning
- Computer ScienceEMNLP
- 2019
This work presents a new crowdsourced dataset containing more than 24K span-selection questions that require resolving coreference among entities in over 4.7K English paragraphs from Wikipedia, and shows that state-of-the-art reading comprehension models perform significantly worse than humans on this benchmark.
A Simple and Effective Model for Answering Multi-span Questions
- Computer ScienceEMNLP
- 2020
This work suggests a new approach for tackling multi-span questions, based on sequence tagging, which differs from previous approaches for answering span questions, and shows that this approach leads to an absolute improvement and slightly eclipses the current state-of-the-art results on the entire DROP dataset.
A Multi-Type Multi-Span Network for Reading Comprehension that Requires Discrete Reasoning
- Computer ScienceEMNLP/IJCNLP
- 2019
The Multi-Type Multi-Span Network (MTMSN) is introduced, a neural reading comprehension model that combines a multi-type answer predictor designed to support various answer types with amulti-span extraction method for dynamically producing one or multiple text spans.
IIRC: A Dataset of Incomplete Information Reading Comprehension Questions
- Computer ScienceEMNLP
- 2020
A dataset with more than 13K questions over paragraphs from English Wikipedia that provide only partial information to answer them, with the missing information occurring in one or more linked documents, finding that it achieves 31.1% F1 on this task, while estimated human performance is 88.4%.
References
SHOWING 1-10 OF 57 REFERENCES
Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences
- Computer ScienceNAACL
- 2018
The dataset is the first to study multi-sentence inference at scale, with an open-ended set of question types that requires reasoning skills, and finds human solvers to achieve an F1-score of 88.1%.
How Much Reading Does Reading Comprehension Require? A Critical Investigation of Popular Benchmarks
- Computer ScienceEMNLP
- 2018
Sensible baselines are established for the bAbI, SQuAD, CBT, CNN, and Who-did-What datasets, finding that question- and passage-only models often perform surprisingly well.
The NarrativeQA Reading Comprehension Challenge
- Computer ScienceTACL
- 2018
A new dataset and set of tasks in which the reader must answer questions about stories by reading entire books or movie scripts are presented, designed so that successfully answering their questions requires understanding the underlying narrative rather than relying on shallow pattern matching or salience.
A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task
- Computer ScienceACL
- 2016
A thorough examination of this new reading comprehension task by creating over a million training examples by pairing CNN and Daily Mail news articles with their summarized bullet points, and showing that a neural network can be trained to give good performance on this task.
Constructing Datasets for Multi-hop Reading Comprehension Across Documents
- Computer ScienceTACL
- 2018
A novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods, in which a model learns to seek and combine evidence — effectively performing multihop, alias multi-step, inference.
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
- Computer ScienceACL
- 2017
It is shown that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers.
SQuAD: 100,000+ Questions for Machine Comprehension of Text
- Computer ScienceEMNLP
- 2016
A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).
Simple and Effective Multi-Paragraph Reading Comprehension
- Computer ScienceACL
- 2018
It is shown that it is possible to significantly improve performance by using a modified training scheme that teaches the model to ignore non-answer containing paragraphs, which involves sampling multiple paragraphs from each document, and using an objective function that requires themodel to produce globally correct output.
Know What You Don’t Know: Unanswerable Questions for SQuAD
- Computer ScienceACL
- 2018
SQuadRUn is a new dataset that combines the existing Stanford Question Answering Dataset (SQuAD) with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.
Bidirectional Attention Flow for Machine Comprehension
- Computer ScienceICLR
- 2017
The BIDAF network is introduced, a multi-stage hierarchical process that represents the context at different levels of granularity and uses bi-directional attention flow mechanism to obtain a query-aware context representation without early summarization.