MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension

@inproceedings{Talmor2019MultiQAAE,
  title={MultiQA: An Empirical Investigation of Generalization and Transfer in Reading Comprehension},
  author={Alon Talmor and Jonathan Berant},
  booktitle={ACL},
  year={2019}
}
A large number of reading comprehension (RC) datasets has been created recently, but little analysis has been done on whether they generalize to one another, and the extent to which existing datasets can be leveraged for improving performance on new ones. In this paper, we conduct such an investigation over ten RC datasets, training on one or more source RC datasets, and evaluating generalization, as well as transfer to a target RC dataset. We analyze the factors that contribute to… Expand
Comprehensive Multi-Dataset Evaluation of Reading Comprehension
TLDR
An evaluation server, ORB, is presented, that reports performance on seven diverse reading comprehension datasets, encouraging and facilitating testing a single model’s capability in understanding a wide variety of reading phenomena. Expand
Single-dataset Experts for Multi-dataset Question Answering
TLDR
This work trains a collection of lightweight, dataset-specific adapter modules that share an underlying Transformer model, and finds that these Multi-Adapter Dataset Experts (MADE) outperform all the authors' baselines in terms of in-distribution accuracy, and simple methods based on parameter-averaging lead to better zero-shot generalization and few-shot transfer performance. Expand
ORB: An Open Reading Benchmark for Comprehensive Evaluation of Machine Reading Comprehension
TLDR
An evaluation server, ORB, is presented, that reports performance on seven diverse reading comprehension datasets, encouraging and facilitating testing a single model's capability in understanding a wide variety of reading phenomena. Expand
DuReader_robust: A Chinese Dataset Towards Evaluating Robustness and Generalization of Machine Reading Comprehension in Real-World Applications
TLDR
A realworld Chinese dataset – DuReaderrobust is introduced, designed to evaluate the MRC models from three aspects: over-sensitivity, over-stability and generalization. Expand
Generalizing Question Answering System with Pre-trained Language Model Fine-tuning
TLDR
A multi-task learning framework that learns the shared representation across different tasks, built on top of a large pre-trained language model, and then fine-tuned on multiple RC datasets is proposed. Expand
Ensemble Learning-Based Approach for Improving Generalization Capability of Machine Reading Comprehension Systems
TLDR
The experimental results show the effectiveness and robustness of the ensemble approach in improving the out-of-distribution accuracy of MRC systems, especially when the base models are similar in accuracies. Expand
Improving QA Generalization by Concurrent Modeling of Multiple Biases
TLDR
This paper investigates the impact of debiasing methods for improving generalization and proposes a general framework for improving the performance on both in-domain and out-of-domain datasets by concurrent modeling of multiple biases in the training data. Expand
CLER: Cross-task Learning with Expert Representation to Generalize Reading and Understanding
TLDR
The proposed CLER, which stands for Cross-task Learning with Expert Representation for the generalization of reading and understanding, is composed of three key ideas: multi-task learning, mixture of experts, and ensemble. Expand
IIRC: A Dataset of Incomplete Information Reading Comprehension Questions
TLDR
A dataset with more than 13K questions over paragraphs from English Wikipedia that provide only partial information to answer them, with the missing information occurring in one or more linked documents, finding that it achieves 31.1% F1 on this task, while estimated human performance is 88.4%. Expand
DAM-Net : Robust QA System with Data Augmentation and Multitask Learning
While a plentitude of models has shown on-par performance with humans on question answering (QA) given context paragraph, several works have shown that they generalize poorly on datasets that areExpand
...
1
2
3
4
5
...

References

SHOWING 1-10 OF 44 REFERENCES
Supervised and Unsupervised Transfer Learning for Question Answering
TLDR
The performance of both models on a TOEFL listening comprehension test and MCTest is significantly improved via a simple transfer learning technique from MovieQA, which achieves the state-of-the-art on all target datasets. Expand
Constructing Datasets for Multi-hop Reading Comprehension Across Documents
TLDR
A novel task to encourage the development of models for text understanding across multiple documents and to investigate the limits of existing methods, in which a model learns to seek and combine evidence — effectively performing multihop, alias multi-step, inference. Expand
DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs
TLDR
A new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs, and presents a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1. Expand
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
TLDR
This new dataset is aimed to overcome a number of well-known weaknesses of previous publicly available datasets for the same task of reading comprehension and question answering, and is the most comprehensive real-world dataset of its kind in both quantity and quality. Expand
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
TLDR
It is shown that, in comparison to other recently introduced large-scale datasets, TriviaQA has relatively complex, compositional questions, has considerable syntactic and lexical variability between questions and corresponding answer-evidence sentences, and requires more cross sentence reasoning to find answers. Expand
Improving Machine Reading Comprehension with General Reading Strategies
TLDR
Three general strategies aimed to improve non-extractive machine reading comprehension (MRC) are proposed and the effectiveness of these proposed strategies and the versatility and general applicability of fine-tuned models that incorporate these strategies are demonstrated. Expand
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
TLDR
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. Expand
RACE: Large-scale ReAding Comprehension Dataset From Examinations
TLDR
The proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state-of-the-art models and the ceiling human performance. Expand
Repartitioning of the ComplexWebQuestions Dataset
TLDR
It is shown that training a RC model directly on the training data of ComplexWebQuestions reveals a leakage from the training set to the test set that allows to obtain unreasonably high performance. Expand
SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine
TLDR
It is shown that there is a meaningful gap between the human and machine performances, which suggests that the proposed dataset could well serve as a benchmark for question-answering. Expand
...
1
2
3
4
5
...