Natural Questions: A Benchmark for Question Answering Research

  title={Natural Questions: A Benchmark for Question Answering Research},
  author={Tom Kwiatkowski and Jennimaria Palomaki and Olivia Redfield and Michael Collins and Ankur P. Parikh and Chris Alberti and Danielle Epstein and Illia Polosukhin and Jacob Devlin and Kenton Lee and Kristina Toutanova and Llion Jones and Matthew Kelcey and Ming-Wei Chang and Andrew M. Dai and Jakob Uszkoreit and Quoc V. Le and Slav Petrov},
  journal={Transactions of the Association for Computational Linguistics},
We present the Natural Questions corpus, a question answering data set. [] Key Method We also describe analysis of 25-way annotations on 302 examples, giving insights into human variability on the annotation task. We introduce robust metrics for the purposes of evaluating question answering systems; demonstrate high human upper bounds on these metrics; and establish baseline results using competitive methods drawn from related literature.

QAMPARI: : An Open-domain Question Answering Benchmark for Questions with Many Answers from Multiple Paragraphs

QQA models from the retrieve-and-read family are trained, showing that QAMP AR I is challenging in terms of both passage retrieval and answer generation, reaching an F 1 score of 26.6 at best.

Challenges in Information-Seeking QA: Unanswerable Questions and Paragraph Retrieval

This study manually annotate 800 unanswerable examples across six languages on what makes them challenging to answer and conducts per-category answerability prediction, revealing issues in the current dataset collection as well as task formulation.

Towards Universal Dense Retrieval for Open-domain Question Answering

This paper introduces an entity-rich question answering dataset constructed from Wikidata facts and demonstrates dense models are unable to generalize to unseen input question distributions, and encourages the field to further investigate the creation of a single, universal dense retrieval model that generalizes well across all input distributions.

A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers

Qasper is presented, a dataset of 5049 questions over 1585 Natural Language Processing papers that is designed to facilitate document-grounded, information-seeking QA, and finds that existing models that do well on other QA tasks do not perform well on answering these questions.

AmazonQA: A Review-Based Question Answering Task

A new dataset is introduced and a method that combines information retrieval techniques for selecting relevant reviews and "reading comprehension" models for synthesizing an answer is proposed for review-based QA, demonstrating the challenging nature of this new task.

QuesBELM: A BERT based Ensemble Language Model for Natural Questions

This work systematically compare the performance of powerful variant models of Transformer architectures-`BERTbase, BERT-large-WWM and ALBERT-XXL’ over Natural Questions dataset and proposes a state-of-the-art BERT based ensemble language model-QuesBELM.

FeTaQA: Free-form Table Question Answering

This work introduces FeTaQA, a new dataset with 10K Wikipedia-based table, question, free-form answer, supporting table cells pairs, and provides two benchmark methods for the proposed task: a pipeline method based on semantic parsing-based QA systems and an end-to-end methodBased on large pretrained text generation models, and shows that FeTaZA poses a challenge for both methods.

Revisiting the Open-Domain Question Answering Pipeline

Mindstone is described, an open-domain QA system that consists of a new multi-stage pipeline that employs a traditional BM25-based information retriever, RM3-based neural relevance feedback, neural ranker, and a machine reading comprehension stage.

How Do We Answer Complex Questions: Discourse Structure of Long-form Answers

An ontology of six sentence-level functional roles for long-form answers is developed, finding that annotators agree less with each other when annotating model-generated answers compared to annotating human-written answers.

RikiNet: Reading Wikipedia Pages for Natural Question Answering

This paper introduces a new model, called RikiNet, which reads Wikipedia pages for natural question answering, which is the first single model that outperforms the single human performance.



QuAC: Question Answering in Context

QuAC introduces challenges not found in existing machine comprehension datasets: its questions are often more open-ended, unanswerable, or only meaningful within the dialog context, as it shows in a detailed qualitative evaluation.

WikiQA: A Challenge Dataset for Open-Domain Question Answering

The WIKIQA dataset is described, a new publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering, which is more than an order of magnitude larger than the previous dataset.

CoQA: A Conversational Question Answering Challenge

CoQA is introduced, a novel dataset for building Conversational Question Answering systems and it is shown that conversational questions have challenging phenomena not present in existing reading comprehension datasets (e.g., coreference and pragmatic reasoning).

SQuAD: 100,000+ Questions for Machine Comprehension of Text

A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

This new dataset is aimed to overcome a number of well-known weaknesses of previous publicly available datasets for the same task of reading comprehension and question answering, and is the most comprehensive real-world dataset of its kind in both quantity and quality.

Know What You Don’t Know: Unanswerable Questions for SQuAD

SQuadRUn is a new dataset that combines the existing Stanford Question Answering Dataset (SQuAD) with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

It is shown that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers.

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

It is shown that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

DuReader: a Chinese Machine Reading Comprehension Dataset from Real-world Applications

This paper introduces DuReader, a new large-scale, open-domain Chinese machine reading comprehension (MRC) dataset, designed to address real-world MRC, and organizes a shared competition to encourage the exploration of more models.

Reading Wikipedia to Answer Open-Domain Questions

This approach combines a search component based on bigram hashing and TF-IDF matching with a multi-layer recurrent neural network model trained to detect answers in Wikipedia paragraphs, indicating that both modules are highly competitive with respect to existing counterparts.