QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension

  title={QA Dataset Explosion: A Taxonomy of NLP Resources for Question Answering and Reading Comprehension},
  author={Anna Rogers and Matt Gardner and Isabelle Augenstein},
  journal={ACM Computing Surveys (CSUR)},
Alongside huge volumes of research on deep learning models in NLP in the recent years, there has been also much work on benchmark datasets needed to track modeling progress. Question answering and reading comprehension have been particularly prolific in this regard, with over 80 new datasets appearing in the past two years. This study is the largest survey of the field to date. We provide an overview of the various formats and domains of the current resources, highlighting the current lacunae… 

Figures and Tables from this paper

CLICKER: A Computational LInguistics Classification Scheme for Educational Resources

This work proposes a classification scheme – CLICKER for CL/NLP based on the analysis of online lectures from 77 university courses on this subject, and discusses how such a taxonomy can help in various real-world applications, including tutoring platforms, resource retrieval, resource recommendation, prerequisite chain learning, and survey generation.

BehanceQA: A New Dataset for Identifying Question-Answer Pairs in Video Transcripts

A large-scale QA identification dataset annotated by human over transcripts of 500 hours of streamed videos is presented and experiments show that the annotated dataset presents unique challenges for existing methods and more research is necessary to explore more effective methods.

English Machine Reading Comprehension Datasets: A Survey

This paper surveys 54 English Machine Reading Comprehension datasets and reveals that Wikipedia is by far the most common data source and that there is a relative lack of why, when, and where questions across datasets.

Beyond Counting Datasets: A Survey of Multilingual Dataset Construction and Necessary Resources

After quantifying the qualitative NLP resource gap across languages, how to improve data collection in low-resource languages is discussed and macro and micro-level suggestions to the NLP community and individual researchers for future multilingual data development are made.

WikiOmnia: filtration and evaluation of the generated QA corpus on the whole Russian Wikipedia

The WikiOmnia dataset is presented, a new publicly available set of QA pairs and corresponding Russian Wikipedia article summary sections, com-posed with a fully automated generation and distribution pipeline.

The Legal Argument Reasoning Task in Civil Procedure

A new NLP task and dataset from the domain of the U.S. civil procedure that consists of a general introduction to the case, a particular question, and a possible solution argument, followed by a detailed analysis of why the argument applies in that case.

Exploring the Utility of Dutch Question Answering Datasets for Human Resource Contact Centres

A Dutch HR QA dataset with over 300 questions in the format of the Squad 2.0 dataset, which distinguishes between answerable and unanswerable questions is created, and various BERT-based models are applied.

EduQG: A Multi-format Multiple Choice Dataset for the Educational Domain

A high-quality dataset that contains 3,397 samples comprising multiple choice questions, answers (including distractors), and their source documents, from the educational domain, that can be used for both question and distractor generation, as well as to explore new challenges such as question format conversion.

Competence-based Question Generation

This work defines competence-based (CB) question generation, and focuses on queries over lexical semantic knowledge involving implicit argument and subevent structure of verbs.

Machine Reading, Fast and Slow: When Do Models “Understand” Language?

It is found that for comparison (but not coreference) the systems based on larger encoders are more likely to rely on the ”right” information, but even they struggle with generalization, suggesting that they still learn specific lexical patterns rather than the general principles of comparison.



Getting Closer to AI Complete Question Answering: A Set of Prerequisite Real Tasks

QuAIL is presented, the first RC dataset to combine text-based, world knowledge and unanswerable questions, and to provide question type annotation that would enable diagnostics of the reasoning strategies by a given QA system.

TyDi QA: A Benchmark for Information-Seeking Question Answering in Typologically Diverse Languages

A quantitative analysis of the data quality and example-level qualitative linguistic analyses of observed language phenomena that would not be found in English-only corpora are presented.

ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning

A new Reading Comprehension dataset requiring logical reasoning (ReClor) extracted from standardized graduate admission examinations is introduced, and it is proposed to identify biased data points and separate them into EASY set while the rest as HARD set in order to comprehensively evaluate the logical reasoning ability of models on ReClor.

SberQuAD - Russian Reading Comprehension Dataset: Description and Analysis

SberQuAD -- a large scale analog of Stanford SQuAD in the Russian language - is a valuable resource that has not been properly presented to the scientific community. We fill this gap by providing a

Introducing MANtIS: a novel Multi-Domain Information Seeking Dialogues Dataset

Conversational search is an approach to information retrieval (IR), where users engage in a dialogue with an agent in order to satisfy their information needs. Previous conceptual work described

MLQA: Evaluating Cross-lingual Extractive Question Answering

This work presents MLQA, a multi-way aligned extractive QA evaluation benchmark intended to spur research in this area, and evaluates state-of-the-art cross-lingual models and machine-translation-based baselines onMLQA.

XQA: A Cross-lingual Open-domain Question Answering Dataset

A novel dataset XQA is constructed that consists of a training set in English as well as development and test sets in eight other languages and provides several baseline systems for cross-lingual OpenQA, showing that the multilingual BERT model achieves the best results in almost all target languages.

ELI5: Long Form Question Answering

This work introduces the first large-scale corpus for long form question answering, a task requiring elaborate and in-depth answers to open-ended questions, and shows that an abstractive model trained with a multi-task objective outperforms conventional Seq2Seq, language modeling, as well as a strong extractive baseline.

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

It is found that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT.