Comprehensive Multi-Dataset Evaluation of Reading Comprehension

  title={Comprehensive Multi-Dataset Evaluation of Reading Comprehension},
  author={Dheeru Dua and Ananth Gottumukkala and Alon Talmor and Matt Gardner and Sameer Singh},
  booktitle={Conference on Empirical Methods in Natural Language Processing},
Reading comprehension is one of the crucial tasks for furthering research in natural language understanding. A lot of diverse reading comprehension datasets have recently been introduced to study various phenomena in natural language, ranging from simple paraphrase matching and entity typing to entity tracking and understanding the implications of the context. Given the availability of many such datasets, comprehensive and reliable evaluation is tedious and time-consuming for researchers… 

Tables from this paper

UnifiedQA: Crossing Format Boundaries With a Single QA System

This work uses the latest advances in language modeling to build a single pre-trained QA model, UNIFIEDQA, that performs well across 19 QA datasets spanning 4 diverse formats, and results in a new state of the art on 10 factoid and commonsense question answering datasets.

Beyond Leaderboards: A survey of methods for revealing weaknesses in Natural Language Inference data and models

This structured survey provides an overview of the evolving research area by categorising reported weaknesses in models and datasets and the methods proposed to reveal and alleviate those weaknesses for the English language.

Towards Human-Centred Explainability Benchmarks For Text Classification

This position paper proposes to extend text classification benchmarks to evaluate the explainability of text classing and proposes to ground these benchmarks in human-centred applications, for example by using social media, gamiflcation or to learn explainability metrics from human judgements.



DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

A new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs, and presents a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1.

MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension

It is shown that training on a source RC dataset and transferring to a target dataset substantially improves performance, even in the presence of powerful contextual representations from BERT (Devlin et al., 2019).

The NarrativeQA Reading Comprehension Challenge

A new dataset and set of tasks in which the reader must answer questions about stories by reading entire books or movie scripts are presented, designed so that successfully answering their questions requires understanding the underlying narrative rather than relying on shallow pattern matching or salience.

DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension

DuoRC is proposed, a novel dataset for Reading Comprehension (RC) that motivates several new challenges for neural approaches in language understanding beyond those offered by existing RC datasets and could complement other RC datasets to explore novel neural approaches for studying language understanding.

SQuAD: 100,000+ Questions for Machine Comprehension of Text

A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).

Learning to Ask Unanswerable Questions for Machine Reading Comprehension

A pair-to-sequence model for unanswerable question generation, which effectively captures the interactions between the question and the paragraph, and a way to construct training data for question generation models by leveraging the existing reading comprehension dataset is presented.

Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning

This work presents a new crowdsourced dataset containing more than 24K span-selection questions that require resolving coreference among entities in over 4.7K English paragraphs from Wikipedia, and shows that state-of-the-art reading comprehension models perform significantly worse than humans on this benchmark.

Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks

This work argues for the usefulness of a set of proxy tasks that evaluate reading comprehension via question answering, and classify these tasks into skill sets so that researchers can identify (and then rectify) the failings of their systems.

Adversarial Examples for Evaluating Reading Comprehension Systems

This work proposes an adversarial evaluation scheme for the Stanford Question Answering Dataset that tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences without changing the correct answer or misleading humans.

Reasoning Over Paragraph Effects in Situations

This work presents ROPES, a challenging benchmark for reading comprehension targeting Reasoning Over Paragraph Effects in Situations, and targets expository language describing causes and effects, as they have clear implications for new situations.