Corpus ID: 236493173

Break, Perturb, Build: Automatic Perturbation of Reasoning Paths through Question Decomposition

  title={Break, Perturb, Build: Automatic Perturbation of Reasoning Paths through Question Decomposition},
  author={Mor Geva and Tomer Wolfson and Jonathan Berant},
Recent efforts to create challenge benchmarks that test the abilities of natural language understanding models have largely depended on human annotations. In this work, we introduce the “Break, Perturb, Build” (BPB) framework for automatic reasoning-oriented perturbation of questionanswer pairs. BPB represents a question by decomposing it into the reasoning steps that are required to answer it, symbolically perturbs the decomposition, and then generates new question-answer pairs. We demonstrate… Expand
Retrieval-guided Counterfactual Generation for QA
This work develops a Retrieve-GenerateFilter technique to create counterfactual evaluation and training data with minimal human supervision, and finds that RGF data leads to significant improvements in a model’s robustness to local perturbations. Expand


Break It Down: A Question Understanding Benchmark
This work introduces a Question Decomposition Meaning Representation (QDMR) for questions, and demonstrates the utility of QDMR by showing that it can be used to improve open-domain question answering on the HotpotQA dataset, and can be deterministically converted to a pseudo-SQL formal language, which can alleviate annotation in semantic parsing applications. Expand
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
A new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs, and presents a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1. Expand
PathQG: Neural Question Generation from Facts
This paper presents a novel task of question generation given a query path in the knowledge graph constructed from the input text, and formulate query representation learning as a sequence labeling problem for identifying the involved facts to form a query and employ an RNN-based generator for question generation. Expand
Automatic Construction of Evaluation Suites for Natural Language Generation Datasets
A framework based on this idea which is able to generate controlled perturbations and identify subsets in text-to-scalar, text- to-text, or data-To-text settings is developed and applied to the GEM generation benchmark. Expand
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
It is found that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT. Expand
Tailor: Generating and Perturbing Text with Semantic Controls
Tailor, a taskagnostic generation system that perturbs text in a semantically-controlled way, is introduced, and its perturbations effectively improve compositionality in fine-grained style transfer, outperforming fine-tuned baselines on 6 transfers. Expand
Adversarial Examples for Evaluating Reading Comprehension Systems
This work proposes an adversarial evaluation scheme for the Stanford Question Answering Dataset that tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences without changing the correct answer or misleading humans. Expand
ASQ: Automatically Generating Question-Answer Pairs using AMRs
This work introduces ASQ, a tool to automatically mine questions and answers from a sentence, using its Abstract Meaning Representation (AMR), making it faster and costeffective, without compromising on the quality and validity of the question-answer pairs thus obtained. Expand
Automatic Generation of Contrast Sets from Scene Graphs: Probing the Compositional Consistency of GQA
This work presents a novel method which leverages rich semantic input representation to automatically generate contrast sets for the visual question answering task and computes the answer of perturbed questions, thus vastly reducing annotation cost and enabling thorough evaluation of models’ performance on various semantic aspects. Expand
SQuAD: 100,000+ Questions for Machine Comprehension of Text
A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). Expand