Training Question Answering Models from Synthetic Data

  title={Training Question Answering Models from Synthetic Data},
  author={Raul Puri and Ryan Spring and Mostofa Patwary and Mohammad Shoeybi and Bryan Catanzaro},
Question and answer generation is a data augmentation method that aims to improve question answering (QA) models given the limited amount of human labeled data. However, a considerable gap remains between synthetic and human-generated question-answer pairs. This work aims to narrow this gap by taking advantage of large language models and explores several factors such as model size, quality of pretrained models, scale of data synthesized, and algorithmic choices. On the SQuAD1.1 question… 

Synthetic Data Generation for Multilingual Domain-Adaptable Question Answering Systems

A pipeline that creates synthetic data from natural text to demonstrate the domain-adaptability of this approach, as well as its multilingual potential, and to obtain synthetic data in English and Dutch is introduced.

Qasar: Self-Supervised Learning Framework for Extractive Question Answering

This paper introduces a novel QA framework, Qasar, using self-supervised learning for efficient domain adaptation, and shows, for the first time, the advantage of fine-tuning pre-trained QA models for closed-domains by synthetically generated domain-specific questions and answers from large language models like T5.

Intermediate Training on Question Answering Datasets Improves Generative Data Augmentation

This work improves generative data augmentation by formulating the data generation as context generation task and use question answering (QA) datasets for intermediate training and uses the GLM to generate relevant contexts, which are further used as synthetic training data for their corresponding tasks.

Synthetic Question Value Estimation for Domain Adaptation of Question Answering

This paper introduces a novel idea of training a question value estimator (QVE) that directly estimates the usefulness of synthetic questions for improving the target-domain QA performance and shows that the synthetic questions selected by QVE can help achieve better target- domain QAperformance, in comparison with existing techniques.

Contrastive Domain Adaptation for Question Answering using Limited Text Corpora

This paper proposes a novel framework for domain adaptation called contrastive domain adaptation for QA (CAQA), which combines techniques from question generation and domain-invariant learning to answer out-of-domain questions in settings with limited text corpora.

Cooperative Self-training of Machine Reading Comprehension

A cooperative self-training framework, RGX, for automatically generating more non-trivial question-answer pairs to improve model performance, and shows that RGX outperforms the state-of-the-art (SOTA) pretrained language models and transfer learning approaches on standard question-answering benchmarks, and yields the new SOTA performance under given model size and transferlearning settings.

Improving Unsupervised Question Answering via Summarization-Informed Question Generation

A distantly-supervised QG method which uses questions generated heuristically from summaries as a source of training data for a QG system, and substantially outperforms previous unsupervised models on three in- domain datasets and three out-of-domain datasets.

Topic Transferable Table Question Answering

This work proposes T3QA (Topic Transferable Table Question Answering) a pragmatic adaptation framework for TableQA comprising of a topic-specific vocabulary injection into BERT, a novel text-to-text transformer generator, a natural language question generation pipeline focused on generating topic- specific training data, and a logical form re-ranker.

Relation-Guided Pre-Training for Open-Domain Question Answering

It is demonstrated that by pre-training with propoed RGPT-QA techique, the popular open-domain QA model, Dense Passage Retriever, achieves 2.2%, 2.4%, and 6.3% absolute improvement in Exact Match accuracy on Natural Questions, TriviaQA, and WebQuestions.

Few-Shot Question Answering by Pretraining Span Selection

This work proposes a new pretraining scheme tailored for question answering: recurring span selection, where masked spans are replaced with a special token that is later used during fine-tuning to select the answer span.



Simple and Effective Semi-Supervised Question Answering

This work envisions a system where the end user specifies a set of base documents and only a few labelled examples, and exploits the document structure to create cloze-style questions from these base documents; pre-trains a powerful neural network on the cloze style questions; and further fine-tunes the model on the labeled examples.

Learning to Answer by Learning to Ask: Getting the Best of GPT-2 and BERT Worlds

The analysis shows that the proposed generation & answering collaboration framework relatively improves both tasks and is particularly powerful in the semi-supervised setup and the results suggest a robust and comparably lean pipeline facilitating question generation in the small-data regime.

Unsupervised Question Answering by Cloze Translation

It is found that modern QA models can learn to answer human questions surprisingly well using only synthetic training data, and is demonstrated that, without using the SQuAD training data at all, this approach achieves 56.4 F1 on SQuad v1.

Language Models are Unsupervised Multitask Learners

It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

It is found that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT.

Learning to Ask Unanswerable Questions for Machine Reading Comprehension

A pair-to-sequence model for unanswerable question generation, which effectively captures the interactions between the question and the paragraph, and a way to construct training data for question generation models by leveraging the existing reading comprehension dataset is presented.

Synthetic QA Corpora Generation with Roundtrip Consistency

A novel method of generating synthetic question answering corpora is introduced by combining models of question generation and answer extraction, and by filtering the results to ensure roundtrip consistency, establishing a new state-of-the-art on SQuAD2 and NQ.

SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

It is shown that there is a meaningful gap between the human and machine performances, which suggests that the proposed dataset could well serve as a benchmark for question-answering.

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

It is shown that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

Good Question! Statistical Ranking for Question Generation

This work uses manually written rules to perform a sequence of general purpose syntactic transformations to turn declarative sentences into questions, which are ranked by a logistic regression model trained on a small, tailored dataset consisting of labeled output from the system.