Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question Answering Data

  title={Self-Teaching Machines to Read and Comprehend with Large-Scale Multi-Subject Question Answering Data},
  author={Dian Yu and Kai Sun and Dong Yu and Claire Cardie},
In spite of much recent research in the area, it is still unclear whether subject-area question-answering data is useful for machine reading comprehension (MRC) tasks. In this paper, we investigate this question. We collect a large-scale multi-subject multiple-choice question-answering dataset, ExamQA, and use incomplete and noisy snippets returned by a web search engine as the relevant context for each questionanswering instance to convert it into a weakly-labeled MRC instance. We then propose… 
1 Citations

Figures and Tables from this paper

Bidirectional Attention Flow Using Answer Pointer and QANet
This work chooses the SQUAD default project and improves the baseline BiDAF model with character level embedding and implement and improve the QANet model, and implements the Answer Pointer model and explores both a lightweight model and a larger model with more parameters.


Learning to Ask Unanswerable Questions for Machine Reading Comprehension
A pair-to-sequence model for unanswerable question generation, which effectively captures the interactions between the question and the paragraph, and a way to construct training data for question generation models by leveraging the existing reading comprehension dataset is presented.
Improving Question Answering with External Knowledge
This work explores simple yet effective methods for exploiting two sources of externalknowledge for exploiting unstructured external knowledge for subject-area QA on multiple-choice question answering tasks in subject areas such as science.
Investigating Prior Knowledge for Challenging Chinese Machine Reading Comprehension
This paper presents the first free-form multiple-Choice Chinese machine reading Comprehension dataset (C3), containing 13,369 documents and their associated 19,577 multiple-choicefree-form questions collected from Chinese-as-a-second-language examinations, and presents a comprehensive analysis of the prior knowledge needed for these real-world problems.
MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text
MCTest is presented, a freely available set of stories and associated questions intended for research on the machine comprehension of text that requires machines to answer multiple-choice reading comprehension questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension.
Improving Machine Reading Comprehension with General Reading Strategies
Three general strategies aimed to improve non-extractive machine reading comprehension (MRC) are proposed and the effectiveness of these proposed strategies and the versatility and general applicability of fine-tuned models that incorporate these strategies are demonstrated.
Unsupervised Adaptation of Question Answering Systems via Generative Self-training
This paper investigates the iterative generation of synthetic QA pairs as a way to realize unsupervised self adaptation, and presents iterative generalizations of the approach, which maximize an approximation of a lower bound on the probability of the adaptation data.
IJCNLP-2017 Task 5: Multi-choice Question Answering in Examinations
The collected data, the format and size of these questions, formal run statistics and results, overview and performance statistics of different methods are described.
RACE: Large-scale ReAding Comprehension Dataset From Examinations
The proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state-of-the-art models and the ceiling human performance.
Logic-Guided Data Augmentation and Regularization for Consistent Question Answering
This paper addresses the problem of improving the accuracy and consistency of responses to comparison questions by integrating logic rules and neural models by leveraging logical and linguistic knowledge to augment labeled training data and then uses a consistency-based regularizer to train the model.
Supervised and Unsupervised Transfer Learning for Question Answering
The performance of both models on a TOEFL listening comprehension test and MCTest is significantly improved via a simple transfer learning technique from MovieQA, which achieves the state-of-the-art on all target datasets.