Is Multihop QA in DiRe Condition? Measuring and Reducing Disconnected Reasoning

@inproceedings{Trivedi2020IsMQ,
  title={Is Multihop QA in DiRe Condition? Measuring and Reducing Disconnected Reasoning},
  author={H. Trivedi and Niranjan Balasubramanian and Tushar Khot and Ashish Sabharwal},
  booktitle={EMNLP},
  year={2020}
}
Has there been real progress in multi-hop question-answering? Models often exploit dataset artifacts to produce correct answers, without connecting information across multiple supporting facts. This limits our ability to measure true progress and defeats the purpose of building multihop QA datasets. We make three contributions towards addressing this. First, we formalize such undesirable behavior as disconnected reasoning across subsets of supporting facts. This allows developing a model… 
Robustifying Multi-hop QA through Pseudo-Evidentiality Training
TLDR
This paper proposes a new approach to learn evidentiality, deciding whether the answer prediction is supported by correct evidences, without such annotations, and compares counterfactual changes in answer confidence with and without evidence sentences to generate “pseudo-evidentiality” annotations.
TextGraphs 2020 Shared Task on Multi-Hop Inference for Explanation Regeneration
TLDR
This second iteration of the explanation regeneration shared task, participants are supplied with more than double the training and evaluation data, as well as a knowledge base nearly double in size, both of which expand into more challenging scientific topics that increase the difficulty of the task.
A Survey on Multi-hop Question Answering and Generation
TLDR
A general and formal definition of MHQA task is provided, the existing attempts to this highly interesting, yet quite challenging task are summarized, and the best methods to createMHQA datasets are outlined.
Hey AI, Can You Solve Complex Tasks by Talking to Agents?
TLDR
A synthetic benchmark, C OMMA QA, is designed with three complex reasoning tasks designed to be solved by communicating with existing QA agents, showing that black-box models struggle to learn this task from scratch even with access to each agent’s knowledge and gold facts supervision.
Reasoning over Public and Private Data in Retrieval-Based Systems
TLDR
This work defines the PUBLIC-PRIVATE AUTOREGRESSIVE INFORMATION RETRIEVAL (PAIR) privacy framework for the novel retrieval setting over multiple privacy scopes and argues that an adequate benchmark is missing to study PAIR since existing textual benchmarks require retrieving from a single data distribution.
♫ MuSiQue: Multihop Questions via Single-hop Question Composition
TLDR
A bottom–up approach is introduced that systematically selects composable pairs of single-hop questions that are connected, that is, where one reasoning step critically relies on information from another, to create MuSiQue-Ans, a new multihop QA dataset with 25K 2–4 hop questions.
An Empirical Survey of Data Augmentation for Limited Data Learning in NLP
TLDR
An empirical survey of recent progress on data augmentation for NLP in the limited labeled data setting is provided, summarizing the landscape of methods and carrying out experiments on 11 datasets covering topics/news classification, inference tasks, paraphrasing tasks, and single-sentence tasks.
Evaluating Explanations for Reading Comprehension with Realistic Counterfactuals
TLDR
This analysis suggests that pairwise explanation techniques are better suited to RC than token-level attributions, which are often unfaithful in the scenarios the authors consider, and proposes an improvement to an attention-based attribution technique, resulting in explanations which better reveal the model’s behavior.
Learning to Solve Complex Tasks by Talking to Agents
TLDR
This work proposes a new benchmark called COMMAQA that contains three kinds of complex reasoning tasks that are designed to be solved by “talking” to four agents with different capabilities and hopes it serves as a novel benchmark to enable the development of “green” AI systems that build upon existing agents.
Reasoning Chain Based Adversarial Attack for Multi-hop Question Answering
TLDR
A multi-hop reasoning chain based adversarial attack method that allows to align the question to each reasoning hop and thus attack any hop, and improves the performance and robustness of these models.
...
1
2
...

References

SHOWING 1-10 OF 31 REFERENCES
XLNet: Generalized Autoregressive Pretraining for Language Understanding
TLDR
XLNet is proposed, a generalized autoregressive pretraining method that enables learning bidirectional contexts by maximizing the expected likelihood over all permutations of the factorization order and overcomes the limitations of BERT thanks to its autore progressive formulation.
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
TLDR
It is shown that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.
Select, Answer and Explain: Interpretable Multi-hop Reading Comprehension over Multiple Documents
TLDR
This paper proposes an effective and interpretable Select, Answer and Explain (SAE) system to solve the multi-document RC problem and achieves top competitive performance in distractor setting compared to other existing systems on the leaderboard.
Compositional Questions Do Not Necessitate Multi-hop Reasoning
TLDR
This work introduces a single-hop BERT-based RC model that achieves 67 F1—comparable to state-of-the-art multi-hop models and designs an evaluation setting where humans are not shown all of the necessary paragraphs for the intendedmulti-hop reasoning but can still answer over 80% of questions.
A Simple Yet Strong Pipeline for HotpotQA
TLDR
This paper presents a simple pipeline based on BERT that outperforms large-scale language models on both question answering and support identification on HotpotQA (and achieves performance very close to a RoBERTa model).
Assessing the Benchmarking Capacity of Machine Reading Comprehension Datasets
TLDR
Results suggest that most of the questions already answered correctly by the model do not necessarily require grammatical and complex reasoning, and therefore, MRC datasets will need to take extra care in their design to ensure that questions can correctly evaluate the intended skills.
Evaluating NLP Models via Contrast Sets
TLDR
A new annotation paradigm for NLP is proposed that helps to close systematic gaps in the test data, and it is recommended that after a dataset is constructed, the dataset authors manually perturb the test instances in small but meaningful ways that change the gold label, creating contrast sets.
Hierarchical Graph Network for Multi-hop Question Answering
TLDR
Experiments on the HotpotQA benchmark demonstrate that the proposed model achieves new state of the art in both the Distractor and Fullwiki settings.
Learning the Difference that Makes a Difference with Counterfactually-Augmented Data
TLDR
This paper focuses on natural language processing, introducing methods and resources for training models less sensitive to spurious patterns, and task humans with revising each document so that it accords with a counterfactual target label and retains internal coherence.
Learning to Retrieve Reasoning Paths over Wikipedia Graph for Question Answering
TLDR
A new graph-based recurrent retrieval approach that learns to retrieve reasoning paths over the Wikipedia graph to answer multi-hop open-domain questions and achieves significant improvement in HotpotQA, outperforming the previous best model by more than 14 points.
...
1
2
3
4
...