Corpus ID: 218487749

Measuring and Reducing Non-Multifact Reasoning in Multi-hop Question Answering

  title={Measuring and Reducing Non-Multifact Reasoning in Multi-hop Question Answering},
  author={H. Trivedi and Niranjan Balasubramanian and Tushar Khot and Ashish Sabharwal},
The measurement of true progress in multihop question-answering has been muddled by the strong ability of models to exploit artifacts and other reasoning shortcuts. Models can produce the correct answer, and even independently identify the supporting facts, without necessarily connecting the information between the facts. This defeats the purpose of building multihop QA datasets. We make three contributions towards addressing this issue. First, we formalize this form of disconnected reasoning… Expand

Figures, Tables, and Topics from this paper

A Survey on Explainability in Machine Reading Comprehension
This paper presents a systematic review of benchmarks and approaches for explainability in Machine Reading Comprehension (MRC), and presents the evaluation methodologies to assess the performance of explainable systems. Expand


Compositional Questions Do Not Necessitate Multi-hop Reasoning
This work introduces a single-hop BERT-based RC model that achieves 67 F1—comparable to state-of-the-art multi-hop models and designs an evaluation setting where humans are not shown all of the necessary paragraphs for the intendedmulti-hop reasoning but can still answer over 80% of questions. Expand
Avoiding Reasoning Shortcuts: Adversarial Evaluation, Training, and Model Development for Multi-Hop QA
This paper shows that in the multi-hop HotpotQA dataset, the examples often contain reasoning shortcuts through which models can directly locate the answer by word-matching the question with a sentence in the context, and shows that the 2-hop model trained on the regular data is more robust to the adversaries than the baseline. Expand
Understanding Dataset Design Choices for Multi-hop Reasoning
This paper investigates two recently proposed datasets, WikiHop and HotpotQA, and explores sentence-factored models for these tasks; by design, these models cannot do multi-hop reasoning, but they are still able to solve a large number of examples in both datasets. Expand
QASC: A Dataset for Question Answering via Sentence Composition
This work presents a multi-hop reasoning dataset, Question Answering via Sentence Composition (QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question, and presents a two-step approach to mitigate the retrieval challenges. Expand
Dynamically Fused Graph Network for Multi-hop Reasoning
D Dynamically Fused Graph Network is proposed, a novel method to answer those questions requiring multiple scattered evidence and reasoning over them, Inspired by human’s step-by-step reasoning behavior. Expand
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
It is shown that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions. Expand
A Simple Yet Strong Pipeline for HotpotQA
This paper presents a simple pipeline based on BERT that outperforms large-scale language models on both question answering and support identification on HotpotQA (and achieves performance very close to a RoBERTa model). Expand
Answering while Summarizing: Multi-task Learning for Multi-hop QA with Evidence Extraction
This study proposes the Query Focused Extractor (QFE) model for evidence extraction and uses multi-task learning with the QA model, inspired by extractive summarization models; compared with the existing method, it sequentially extracts evidence sentences by using an RNN with an attention mechanism on the question sentence. Expand
Select, Answer and Explain: Interpretable Multi-hop Reading Comprehension over Multiple Documents
This paper proposes an effective and interpretable Select, Answer and Explain (SAE) system to solve the multi-document RC problem and achieves top competitive performance in distractor setting compared to other existing systems on the leaderboard. Expand
Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets
Results suggest that most of the questions already answered correctly by the model do not necessarily require grammatical and complex reasoning, and therefore, MRC datasets will need to take extra care in their design to ensure that questions can correctly evaluate the intended skills. Expand