Unsupervised Evaluation for Question Answering with Transformers

  title={Unsupervised Evaluation for Question Answering with Transformers},
  author={Lukas Muttenthaler and Isabelle Augenstein and Johannes Bjerva},
It is challenging to automatically evaluate the answer of a QA model at inference time. Although many models provide confidence scores, and simple heuristics can go a long way towards indicating answer correctness, such measures are heavily dataset-dependent and are unlikely to generalise. In this work, we begin by investigating the hidden representations of questions, answers, and contexts in transformer-based QA architectures. We observe a consistent pattern in the answer representations… Expand

Figures and Tables from this paper

SubjQA: A Dataset for Subjectivity and Review Comprehension
This work investigates the relationship between subjectivity and QA, while developing a new dataset containing subjectivity annotations for questions and answer spans across 6 distinct domains, and releases an English QA dataset (SubjQA) based on customer reviews. Expand
Subjective Question Answering: Deciphering the inner workings of Transformers in the realm of subjectivity
The inner workings (i.e., latent representations) of a Transformer-based architecture are investigated to contribute to a better understanding of these not yet well understood "black-box" models. Expand


Know What You Don’t Know: Unanswerable Questions for SQuAD
SQuadRUn is a new dataset that combines the existing Stanford Question Answering Dataset (SQuAD) with over 50,000 unanswerable questions written adversarially by crowdworkers to look similar to answerable ones. Expand
SQuAD: 100,000+ Questions for Machine Comprehension of Text
A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%). Expand
How Does BERT Answer Questions?: A Layer-Wise Analysis of Transformer Representations
A layer-wise analysis of BERT's hidden states reveals that fine-tuning has little impact on the models' semantic abilities and that prediction errors can be recognized in the vector representations of even early layers. Expand
GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding
A benchmark of nine diverse NLU tasks, an auxiliary dataset for probing models for understanding of specific linguistic phenomena, and an online platform for evaluating and comparing models, which favors models that can represent linguistic knowledge in a way that facilitates sample-efficient learning and effective knowledge-transfer across tasks. Expand
Whatcha lookin' at? DeepLIFTing BERT's Attention in Question Answering
This paper investigates one such model, BERT for question-answering, with the aim to analyze why it is able to achieve significantly better results than other models. Expand
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks. Expand
An Assessment of the Accuracy of Automatic Evaluation in Summarization
An assessment of the automatic evaluations used for multi-document summarization of news, and recommendations about how any evaluation, manual or automatic, should be used to find statistically significant differences between summarization systems. Expand
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
This work proposes a method to pre-train a smaller general-purpose language representation model, called DistilBERT, which can be fine-tuned with good performances on a wide range of tasks like its larger counterparts, and introduces a triple loss combining language modeling, distillation and cosine-distance losses. Expand
Bleu: a Method for Automatic Evaluation of Machine Translation
This work proposes a method of automatic machine translation evaluation that is quick, inexpensive, and language-independent, that correlates highly with human evaluation, and that has little marginal cost per run. Expand
Adam: A Method for Stochastic Optimization
This work introduces Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order moments, and provides a regret bound on the convergence rate that is comparable to the best known results under the online convex optimization framework. Expand