Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies

@article{Geva2021DidAU,
  title={Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies},
  author={Mor Geva and Daniel Khashabi and Elad Segal and Tushar Khot and Dan Roth and Jonathan Berant},
  journal={Transactions of the Association for Computational Linguistics},
  year={2021},
  volume={9},
  pages={346-361}
}
Abstract A key limitation in current datasets for multi-hop reasoning is that the required steps for answering the question are mentioned in it explicitly. In this work, we introduce StrategyQA, a question answering (QA) benchmark where the required reasoning steps are implicit in the question, and should be inferred using a strategy. A fundamental challenge in this setup is how to elicit such creative questions from crowdsourcing workers, while covering a broad range of potential strategies… Expand
Which Linguist Invented the Lightbulb? Presupposition Verification for Question-Answering
TLDR
It is found that adding presuppositions and their verifiability to an existing model yields modest gains in downstream performance and unanswerability detection, and a preliminary approach to integrating these steps into an existing QA system. Expand
TellMeWhy: A Dataset for Answering Why-Questions in Narratives
TLDR
This work introduces TellMeWhy, a new crowd-sourced dataset that consists of more than 30k questions and free-form answers concerning why characters in short narratives perform the actions described, and shows that state-of-the-art models are far below human performance on answering such questions. Expand
ExplaGraphs: An Explanation Graph Generation Task for Structured Commonsense Reasoning
TLDR
This work presents EXPLAGRAPHS, a new generative and structured commonsense-reasoning task (and an associated dataset) of explanation graph generation for stance prediction, and proposes a multi-level evaluation framework that checks for the structural and semantic correctness of the generated graphs and their degree of match with ground-truth graphs. Expand
Explaining Answers with Entailment Trees
TLDR
ENTAILMENTBANK is created, the first dataset to contain multistep entailment trees, providing a new type of dataset (multistep entails) and baselines, offering a new avenue for the community to generate richer, more systematic explanations. Expand
Exploiting Reasoning Chains for Multi-hop Science Question Answering
TLDR
The proposed Chain Guided Retrieverreader framework allows the retriever to capture step-by-step clues of the entire reasoning process, which is not only shown to be effective on two challenging multi-hop Science QA tasks, namely OpenBookQA and ARCChallenge, but also favors explainability. Expand
CREAK: A Dataset for Commonsense Reasoning over Entity Knowledge
TLDR
This work introduces CREAK, a testbed for commonsense reasoning about entity knowledge, bridging fact-checking about entities (Harry Potter is a wizard and is skilled at riding a broomstick) with commonsense inferences (if you're good at a skill you can teach others how to do it). Expand
CommonsenseQA 2.0: Exposing the Limits of AI through Gamification
Constructing benchmarks that test the abilities of modern natural language un1 derstanding models is difficult – pre-trained language models exploit artifacts in 2 benchmarks to achieve human parity,Expand
MuSiQue: Multi-hop Questions via Single-hop Question Composition
TLDR
This work proposes a bottom-up semi-automatic process of constructing multihop question via composition of single-hop questions, and uses this process to construct a new multi-hop QA dataset, MuSiQue-Ans, which is challenging for state-of-the-art QA models. Expand
Learning to Solve Complex Tasks by Talking to Agents
TLDR
This work proposes a new benchmark called COMMAQA that contains three kinds of complex reasoning tasks that are designed to be solved by “talking” to four agents with different capabilities and hopes it serves as a novel benchmark to enable the development of “green” AI systems that build upon existing agents. Expand
On the Diversity and Limits of Human Explanations
  • Chenhao Tan
  • Computer Science
  • ArXiv
  • 2021
TLDR
Inspired by prior work in psychology and cognitive sciences, existing human explanations in NLP are group into three categories: proximal mechanism, evidence, and procedure, which differ in nature and have implications for the resultant explanations. Expand
...
1
2
...

References

SHOWING 1-10 OF 34 REFERENCES
Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for Multi-Hop QA
TLDR
This paper shows that in the multi-hop HotpotQA dataset, the examples often contain reasoning shortcuts through which models can directly locate the answer by word-matching the question with a sentence in the context, and shows that the 2-hop model trained on the regular data is more robust to the adversaries than the baseline. Expand
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
TLDR
A new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs, and presents a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1. Expand
QASC: A Dataset for Question Answering via Sentence Composition
TLDR
This work presents a multi-hop reasoning dataset, Question Answering via Sentence Composition (QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question, and presents a two-step approach to mitigate the retrieval challenges. Expand
Break It Down: A Question Understanding Benchmark
TLDR
This work introduces a Question Decomposition Meaning Representation (QDMR) for questions, and demonstrates the utility of QDMR by showing that it can be used to improve open-domain question answering on the HotpotQA dataset, and can be deterministically converted to a pseudo-SQL formal language, which can alleviate annotation in semantic parsing applications. Expand
The Web as a Knowledge-Base for Answering Complex Questions
TLDR
This paper proposes to decompose complex questions into a sequence of simple questions, and compute the final answer from the sequence of answers, and empirically demonstrates that question decomposition improves performance from 20.8 precision@1 to 27.5 precision @1 on this new dataset. Expand
Natural Questions: A Benchmark for Question Answering Research
TLDR
The Natural Questions corpus, a question answering data set, is presented, introducing robust metrics for the purposes of evaluating question answering systems; demonstrating high human upper bounds on these metrics; and establishing baseline results using competitive methods drawn from related literature. Expand
Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences
TLDR
The dataset is the first to study multi-sentence inference at scale, with an open-ended set of question types that requires reasoning skills, and finds human solvers to achieve an F1-score of 88.1%. Expand
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
TLDR
It is shown that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions. Expand
Multi-hop Reading Comprehension through Question Decomposition and Rescoring
TLDR
A system that decomposes a compositional question into simpler sub-questions that can be answered by off-the-shelf single-hop RC models is proposed and a new global rescoring approach is introduced that considers each decomposition to select the best final answer, greatly improving overall performance. Expand
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
TLDR
It is found that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT. Expand
...
1
2
3
4
...