MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics

  title={MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics},
  author={Anthony Chen and Gabriel Stanovsky and Sameer Singh and Matt Gardner},
Posing reading comprehension as a generation problem provides a great deal of flexibility, allowing for open-ended questions with few restrictions on possible answers. However, progress is impeded by existing generation metrics, which rely on token overlap and are agnostic to the nuances of reading comprehension. To address this, we introduce a benchmark for training and evaluating generative reading comprehension metrics: MOdeling Correctness with Human Annotations. MOCHA contains 40K human… 

Challenges in Information-Seeking QA: Unanswerable Questions and Paragraph Retrieval

This study manually annotate 800 unanswerable examples across six languages on what makes them challenging to answer and conducts per-category answerability prediction, revealing issues in the current dataset collection as well as task formulation.

Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary

This work proposes a metric to evaluate the content quality of a summary using question-answering (QA), and identifies its performance bottlenecks and estimates that its potential upper-bound performance surpasses all other automatic metrics, approaching that of the gold-standard Pyramid Method.

GENIE: Toward Reproducible and Standardized Human Evaluation for Text Generation

G ENIE is introduced: a system for running standardized human evaluations across different generation tasks, instantiate with datasets representing four core challenges in text generation: machine translation, summarization, commonsense reasoning, and machine comprehension, and develops an automated mechanism for maintaining annotator quality via a probabilistic model.

CrossFit: A Few-shot Learning Challenge for Cross-task Generalization in NLP

This paper presents the NLP Few-shot Gym, a repository of 160 diverse few-shot NLP tasks created from open-access NLP datasets and converted to a unified text-to-text format, and reveals that the few- shot learning ability on unseen tasks can be improved via an upstream learning stage using a set of seen tasks.

CoreQuisite: Circumstantial Preconditions of Common Sense Knowledge

A dataset is presented, called CoreQuisite, which annotates commonsense facts with preconditions expressed in natural language, and it is shown that there is a 10-30%gap between machine and human performance on these tasks.

Discord Questions: A Computational Approach To Diversity Analysis in News Coverage

A new framework to assist readers in identifying source differences and gaining an understanding of news coverage diversity is proposed, based on the generation of Discord Questions : questions with a diverse answer pool, explicitly illustrating source differences.

Benchmarking Answer Verification Methods for Question Answering-Based Summarization Evaluation Metrics

This work benchmarks the lexical answer verification methods used by current QA-based metrics as well as two more sophisticated text comparison methods, BERTScore and LERC, and finds that LERC out-performs the other methods in some settings while remaining statistically indistinguishable from lexical overlap in others.

"Covid vaccine is against Covid but Oxford vaccine is made at Oxford!" Semantic Interpretation of Proper Noun Compounds

A new manually annotated dataset, P RO NCI, consisting of 22.5K proper noun compounds along with their free-form semantic interpretations, and finds that adding targeted knowledge, particu-larly about the common noun, results in performance gains of upto 2.8%.

QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization

This work proposes an optimized metric, which they call QAFactEval, that leads to a 14% average improvement over previous QA-based metrics on the SummaC factual consistency benchmark, and also outperforms the best-performing entailment-based metric.

Evaluation of Review Summaries via Question-Answering

RunQA, Review Summary Evaluation via Question Answering, correlates well with human judgments in terms of coverage and focus of information and it is shown that the proposed approach is more robust than metrics in the literature for ranking summaries.



DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

A new reading comprehension benchmark, DROP, which requires Discrete Reasoning Over the content of Paragraphs, and presents a new model that combines reading comprehension methods with simple numerical reasoning to achieve 51% F1.

Looking Beyond the Surface: A Challenge Set for Reading Comprehension over Multiple Sentences

The dataset is the first to study multi-sentence inference at scale, with an open-ended set of question types that requires reasoning skills, and finds human solvers to achieve an F1-score of 88.1%.

Adversarial Examples for Evaluating Reading Comprehension Systems

This work proposes an adversarial evaluation scheme for the Stanford Question Answering Dataset that tests whether systems can answer questions about paragraphs that contain adversarially inserted sentences without changing the correct answer or misleading humans.

The NarrativeQA Reading Comprehension Challenge

A new dataset and set of tasks in which the reader must answer questions about stories by reading entire books or movie scripts are presented, designed so that successfully answering their questions requires understanding the underlying narrative rather than relying on shallow pattern matching or salience.

SQuAD: 100,000+ Questions for Machine Comprehension of Text

A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).

Cosmos QA: Machine Reading Comprehension with Contextual Commonsense Reasoning

This paper introduces Cosmos QA, a large-scale dataset of 35,600 problems that require commonsense-based reading comprehension, formulated as multiple-choice questions, and proposes a new architecture that improves over the competitive baselines.

Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning

This work presents a new crowdsourced dataset containing more than 24K span-selection questions that require resolving coreference among entities in over 4.7K English paragraphs from Wikipedia, and shows that state-of-the-art reading comprehension models perform significantly worse than humans on this benchmark.

MCTest: A Challenge Dataset for the Open-Domain Machine Comprehension of Text

MCTest is presented, a freely available set of stories and associated questions intended for research on the machine comprehension of text that requires machines to answer multiple-choice reading comprehension questions about fictional stories, directly tackling the high-level goal of open-domain machine comprehension.

RACE: Large-scale ReAding Comprehension Dataset From Examinations

The proportion of questions that requires reasoning is much larger in RACE than that in other benchmark datasets for reading comprehension, and there is a significant gap between the performance of the state-of-the-art models and the ceiling human performance.

Evaluating Question Answering Evaluation

This work studies the suitability of existing metrics in QA and explores using BERTScore, a recently proposed metric for evaluating translation, for QA, finding that although it fails to provide stronger correlation with human judgements, future work focused on tailoring a BERT-based metric to QA evaluation may prove fruitful.