Selective Question Answering under Domain Shift

@inproceedings{Kamath2020SelectiveQA,
  title={Selective Question Answering under Domain Shift},
  author={Amita Kamath and Robin Jia and Percy Liang},
  booktitle={ACL},
  year={2020}
}
To avoid giving wrong answers, question answering (QA) models need to know when to abstain from answering. Moreover, users often ask questions that diverge from the model’s training data, making errors more likely and thus abstention more critical. In this work, we propose the setting of selective question answering under domain shift, in which a QA model is tested on a mixture of in-domain and out-of-domain data, and must answer (i.e., not abstain on) as many questions as possible while… 

Figures and Tables from this paper

Know When To Abstain: Calibrating Question Answering System under Domain Shift

TLDR
This project focuses on confidence modeling of QA systems under domain shift, and proposes a systematic approach to calibrate the model by augmenting it with a calibrator trained on a small subset of out-of-domain examples.

Robust Question Answering Through Sub-part Alignment

TLDR
This work model question answering as an alignment problem, decomposing both the question and context into smaller units based on off-the-shelf semantic representations, and align the question to a subgraph of the context in order to find the answer.

Domain Adaptation for Question Answering via Question Classification

TLDR
The effectiveness of the proposed QC4QA with consistent improvements against the state-of-the-art baselines on multiple datasets is demonstrated.

Can NLI Models Verify QA Systems' Predictions?

TLDR
Careful manual analysis over the predictions of the NLI model shows that it can further identify cases where the QA model produces the right answer for the wrong reason, i.e., when the answer sentence does not address all aspects of the question.

How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering

TLDR
This paper examines three strong generative models -- T5, BART, and GPT-2 -- and examines methods to calibrate such models to make their confidence scores correlate better with the likelihood of correctness through fine-tuning, post-hoc probability modification, or adjustment of the predicted outputs or inputs.

Will this Question be Answered? Question Filtering via Answer Model Distillation for Efficient Question Answering

TLDR
An interesting new finding is made: the answer confidence scores of state-of-the-art QA systems can be approximated well by models solely using the input question text, which enables preemptive filtering of questions that are not answered by the system due to theiranswer confidence scores being lower than the system threshold.

Towards Improving Selective Prediction Ability of NLP Systems

TLDR
This work proposes a method that improves probability estimates of models by calibrating them using prediction confidence and difficulty score of instances, and trains a calibrator to predict the likelihood of correctness of the model’s prediction.

Challenges in Information-Seeking QA: Unanswerable Questions and Paragraph Retrieval

TLDR
This study manually annotate 800 unanswerable examples across six languages on what makes them challenging to answer and conducts per-category answerability prediction, revealing issues in the current dataset collection as well as task formulation.

Knowing More About Questions Can Help: Improving Calibration in Question Answering

TLDR
This work presents the first calibration study in the open retrieval setting, comparing the calibration accuracy of retrieval-based span prediction models and answer generation models, and shows robust gains in all settings.

It's better to say "I can't answer" than answering incorrectly: Towards Safety critical NLP systems

TLDR
This work proposes a methodology that incorporates the degree of correctness, shifting away from classification labels as it directly tries to predict the probability of model's prediction being correct, and outperforms existing approaches on existing Natural Language Inference datasets.
...

References

SHOWING 1-10 OF 57 REFERENCES

Evaluating Question Answering Evaluation

TLDR
This work studies the suitability of existing metrics in QA and explores using BERTScore, a recently proposed metric for evaluating translation, for QA, finding that although it fails to provide stronger correlation with human judgements, future work focused on tailoring a BERT-based metric to QA evaluation may prove fruitful.

Know What You Don’t Know: Unanswerable Questions for SQuAD

TLDR
SQuadRUn is a new dataset that combines the existing Stanford Question Answering Dataset (SQuAD) with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones.

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

TLDR
It is shown that HotpotQA is challenging for the latest QA systems, and the supporting facts enable models to improve performance and make explainable predictions.

Natural Questions: A Benchmark for Question Answering Research

TLDR
The Natural Questions corpus, a question answering data set, is presented, introducing robust metrics for the purposes of evaluating question answering systems; demonstrating high human upper bounds on these metrics; and establishing baseline results using competitive methods drawn from related literature.

Quizbowl: The Case for Incremental Question Answering

TLDR
This work makes two key contributions to machine learning research through Quizbowl: collecting and curating a large factoid QA dataset and an accompanying gameplay dataset, and developing a computational approach to playing Quiz Bowl that involves determining both what to answer and when to answer.

WikiQA: A Challenge Dataset for Open-Domain Question Answering

TLDR
The WIKIQA dataset is described, a new publicly available set of question and sentence pairs, collected and annotated for research on open-domain question answering, which is more than an order of magnitude larger than the previous dataset.

SQuAD: 100,000+ Questions for Machine Comprehension of Text

TLDR
A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).

The process of question answering.

TLDR
This theory of question answering has been implemented in a computer program, QUALM, currently being used by two story understanding systems to complete a natural language processing system which reads stories and answers questions about what was read.

Is It the Right Answer? Exploiting Web Redundancy for Answer Validation

TLDR
This work presents a novel approach to answer validation based on the intuition that the amount of implicit knowledge which connects an answer to a question can be quantitatively estimated by exploiting the redundancy of Web information.

SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine

TLDR
It is shown that there is a meaningful gap between the human and machine performances, which suggests that the proposed dataset could well serve as a benchmark for question-answering.
...