Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

  title={Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies},
  author={Mor Geva and Daniel Khashabi and Elad Segal and Tushar Khot and Dan Roth and Jonathan Berant},
  journal={Transactions of the Association for Computational Linguistics},
Abstract A key limitation in current datasets for multi-hop reasoning is that the required steps for answering the question are mentioned in it explicitly. In this work, we introduce StrategyQA, a question answering (QA) benchmark where the required reasoning steps are implicit in the question, and should be inferred using a strategy. A fundamental challenge in this setup is how to elicit such creative questions from crowdsourcing workers, while covering a broad range of potential strategies… 

Which Linguist Invented the Lightbulb? Presupposition Verification for Question-Answering

It is shown that a simple modification of adding presuppositions and their verifiability to the input of a competitive end-to-end QA system yields modest gains in QA performance and unanswerability detection, demonstrating the promise of the approach.

TellMeWhy: A Dataset for Answering Why-Questions in Narratives

This work introduces TellMeWhy, a new crowd-sourced dataset that consists of more than 30k questions and free-form answers concerning why characters in short narratives perform the actions described, and shows that state-of-the-art models are far below human performance on answering such questions.

How much coffee was consumed during EMNLP 2019? Fermi Problems: A New Reasoning Challenge for AI

Several unsolved AI problems are crystallized into a single, new challenge, namely Fermi Problems (FPs), which are questions whose answers can only be approximately estimated because their precise computation is either impractical or impossible.

Teaching Broad Reasoning Skills via Decomposition-Guided Contexts

This work shows how one can effectively use decomposition-guided contexts to robustly teach multihop reasoning, and substantially improves model performance and robustness even when starting with numeracy-aware LMs pretrained using recent methods.

Explaining Answers with Entailment Trees

ENTAILMENTBANK is created, the first dataset to contain multistep entailment trees, providing a new type of dataset (multistep entails) and baselines, offering a new avenue for the community to generate richer, more systematic explanations.

Inferring Implicit Relations with Language Models

This work investigates why current models struggle with implicit reasoning question answering (QA) tasks, by decoupling inference of reasoning steps from their execution, and suggests that the bottleneck for answering implicit reasoning questions is in the ability of language models to retrieve and reason over information rather than to plan an accurate reasoning strategy.

ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense Reasoning

This work presents ExplaGraphs, a new generative and structured commonsense-reasoning task (and an associated dataset) of explanation graph generation for stance prediction, and proposes a multi-level evaluation framework that check for the structural and semantic correctness of the generated graphs and their degree of match with ground-truth graphs.

The Unreliability of Explanations in Few-Shot In-Context Learning

A framework for calibrating model predictions based on the reliability of explanations is presented and it is shown that explanations judged as good by humans—those that are logically consistent with the input and the prediction—usually indicate more accurate predictions.

CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge

This work introduces CREAK, a testbed for commonsense reasoning about entity knowledge, bridging fact-checking about entities (Harry Potter is a wizard and is skilled at riding a broomstick) with commonsense inferences (if you’re good at a skill you can teach others how to do it).

♫ MuSiQue: Multihop Questions via Single-hop Question Composition

A bottom–up approach is introduced that systematically selects composable pairs of single-hop questions that are connected, that is, where one reasoning step critically relies on information from another, to create a new multihop QA dataset with 25K 2–4 hop questions.



Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for Multi-Hop QA

This paper shows that in the multi-hop HotpotQA dataset, the examples often contain reasoning shortcuts through which models can directly locate the answer by word-matching the question with a sentence in the context, and shows that the 2-hop model trained on the regular data is more robust to the adversaries than the baseline.

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

A new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs, and presents a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1.

QASC: A Dataset for Question Answering via Sentence Composition

This work presents a multi-hop reasoning dataset, Question Answering via Sentence Composition (QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question, and provides annotation for supporting facts as well as their composition.

Unsupervised Question Decomposition for Question Answering

An algorithm for One-to-N Unsupervised Sequence transduction (ONUS) that learns to map one hard, multi-hop question to many simpler, single-hop sub-questions, which is promising for shedding light on why a QA system makes a prediction.

The Web as a Knowledge-Base for Answering Complex Questions

This paper proposes to decompose complex questions into a sequence of simple questions, and compute the final answer from the sequence of answers, and empirically demonstrates that question decomposition improves performance from 20.8 precision@1 to 27.5 precision @1 on this new dataset.

Natural Questions: A Benchmark for Question Answering Research

The Natural Questions corpus, a question answering data set, is presented, introducing robust metrics for the purposes of evaluating question answering systems; demonstrating high human upper bounds on these metrics; and establishing baseline results using competitive methods drawn from related literature.

Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences

The dataset is the first to study multi-sentence inference at scale, with an open-ended set of question types that requires reasoning skills, and finds human solvers to achieve an F1-score of 88.1%.

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

It is shown that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

Multi-hop Reading Comprehension through Question Decomposition and Rescoring

A system that decomposes a compositional question into simpler sub-questions that can be answered by off-the-shelf single-hop RC models is proposed and a new global rescoring approach is introduced that considers each decomposition to select the best final answer, greatly improving overall performance.

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

It is found that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT.