UnifiedQA: Crossing Format Boundaries With a Single QA System

@inproceedings{Khashabi2020UnifiedQACF,
  title={UnifiedQA: Crossing Format Boundaries With a Single QA System},
  author={Daniel Khashabi and Sewon Min and Tushar Khot and Ashish Sabharwal and Oyvind Tafjord and Peter Clark and Hannaneh Hajishirzi},
  booktitle={FINDINGS},
  year={2020}
}
Question answering (QA) tasks have been posed using a variety of formats, such as extractive span selection, multiple choice, etc. This has led to format-specialized models, and even to an implicit division in the QA community. We argue that such boundaries are artificial and perhaps unnecessary, given the reasoning abilities we seek to teach are not governed by the format. As evidence, we use the latest advances in language modeling to build a single pre-trained QA model, UNIFIEDQA, that… 

Figures and Tables from this paper

Towards Question Format Independent Numerical Reasoning: A Set of Prerequisite Tasks
TLDR
This work introduces NUMBERGAME, a multifaceted benchmark to evaluate model performance across numerical reasoning tasks of eight diverse formats, and takes forward the recent progress in generic system development, demonstrating the scope of under-explored tasks.
Foreshadowing the Benefits of Incidental Supervision
TLDR
A unified PAC-Bayesian Informativeness measure (PABI) is proposed, characterizing the reduction in uncertainty that incidental supervision signals provide and demonstrating PABI's use in quantifying various types of incidental signals.
Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge
TLDR
This work provides a first demonstration that LMs can be trained to reliably perform systematic reasoning combining both implicit, pre-trained knowledge and explicit natural language statements, and demonstrates that models learn to effectively perform inference which involves implicit taxonomic and world knowledge, chaining and counting.
ISAAQ - Mastering Textbook Questions with Pre-trained Transformers and Bottom-Up and Top-Down Attention
TLDR
This paper taps on the potential of transformer language models and bottom-up and top-down attention to tackle the language and visual understanding challenges this task entails, and relies on pre-trained transformers, fine-tuning and ensembling.
Language Models are Few-Shot Learners
TLDR
GPT-3 achieves strong performance on many NLP datasets, including translation, question-answering, and cloze tasks, as well as several tasks that require on-the-fly reasoning or domain adaptation, such as unscrambling words, using a novel word in a sentence, or performing 3-digit arithmetic.
Measuring Massive Multitask Language Understanding
TLDR
While most recent models have near random-chance accuracy, the very largest GPT-3 model improves over random chance by almost 20 percentage points on average, however, on every one of the 57 tasks, the best models still need substantial improvements before they can reach expert-level accuracy.
Text-to-Text Pre-Training for Data-to-Text Tasks
TLDR
It is indicated that text-to-text pre-training in the form of T5 enables simple, end- to-end transformer based models to outperform pipelined neural architectures tailored for data-to/text generation, as well as alternatives such as BERT and GPT-2.
BBQ: A hand-built bias benchmark for question answering
TLDR
The Bias Benchmark for QA (BBQ), a dataset consisting of question-sets constructed by the authors that highlight attested social biases against people belonging to protected classes along nine different social dimensions relevant for U.S. English-speaking contexts, is introduced.
BiRdQA: A Bilingual Dataset for Question Answering on Tricky Riddles
TLDR
BiRdQA, a bilingual multiple-choice question answering dataset with 6614 English riddles and 8751 Chinese riddles, is introduced, indicating that there is a long way to go before machine can beat human on solving tricky riddles.
Towards General Natural Language Understanding with Probabilistic Worldbuilding
We introduce the Probabilistic Worldbuilding Model (PWM), a new fully symbolic Bayesian model of semantic parsing and reasoning, as a first step in a research program toward more domain- and
...
...

References

SHOWING 1-10 OF 56 REFERENCES
Language Models are Unsupervised Multitask Learners
TLDR
It is demonstrated that language models begin to learn these tasks without any explicit supervision when trained on a new dataset of millions of webpages called WebText, suggesting a promising path towards building language processing systems which learn to perform tasks from their naturally occurring demonstrations.
QASC: A Dataset for Question Answering via Sentence Composition
TLDR
This work presents a multi-hop reasoning dataset, Question Answering via Sentence Composition (QASC), that requires retrieving facts from a large corpus and composing them to answer a multiple-choice question, and provides annotation for supporting facts as well as their composition.
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
TLDR
A new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs, and presents a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1.
Evaluating NLP Models via Contrast Sets
TLDR
A new annotation paradigm for NLP is proposed that helps to close systematic gaps in the test data, and it is recommended that after a dataset is constructed, the dataset authors manually perturb the test instances in small but meaningful ways that change the gold label, creating contrast sets.
A Discrete Hard EM Approach for Weakly Supervised Question Answering
TLDR
This paper develops a hard EM learning scheme that computes gradients relative to the most likely solution at each update and significantly outperforms previous methods on six QA tasks, including absolute gains of 2–10%, and achieves the state-of-the-art on five of them.
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer
TLDR
This systematic study compares pre-training objectives, architectures, unlabeled datasets, transfer approaches, and other factors on dozens of language understanding tasks and achieves state-of-the-art results on many benchmarks covering summarization, question answering, text classification, and more.
Comprehensive Multi-Dataset Evaluation of Reading Comprehension
TLDR
An evaluation server, ORB, is presented, that reports performance on seven diverse reading comprehension datasets, encouraging and facilitating testing a single model’s capability in understanding a wide variety of reading phenomena.
BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions
TLDR
It is found that transferring from entailment data is more effective than transferring from paraphrase or extractive QA data, and that it, surprisingly, continues to be very beneficial even when starting from massive pre-trained language models such as BERT.
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
TLDR
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.
CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge
TLDR
This work presents CommonsenseQA: a challenging new dataset for commonsense question answering, which extracts from ConceptNet multiple target concepts that have the same semantic relation to a single source concept.
...
...