QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization

  title={QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization},
  author={Alexander R. Fabbri and Chien Sheng Wu and Wenhao Liu and Caiming Xiong},
  booktitle={North American Chapter of the Association for Computational Linguistics},
Factual consistency is an essential quality of text summarization models in practical settings. Existing work in evaluating this dimension can be broadly categorized into two lines of research, entailment-based and question answering (QA)-based metrics, and different experimental setups often lead to contrasting conclusions as to which paradigm performs the best. In this work, we conduct an extensive comparison of entailment and QA-based metrics, demonstrating that carefully choosing the… 

SWING: Balancing Coverage and Faithfulness for Dialogue Summarization

The correlation between commonly used automatic metrics with human judgments in terms of three different dimensions regarding coverage and factual consistency is computed to provide insight into the most suitable metric for evaluating dialogue summaries.

WeCheck: Strong Factual Consistency Checker via Weakly Supervised Learning

A weakly supervised framework that aggregates multiple resources to train a precise andcient factual metric, namely WeCheck is proposed, which achieves a 5% relative relative improvement over previous state-of-the-art methods on TRUE benchmark on average.

RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question

It is shown that RQUGE has a higher correlation with human judgment without re-lying on the reference question, and can improve the performance of QA models on out-of-domain datasets by tuning on the synthetic data generated by a question generation model and re-ranked by RQuge.

Shortcomings of Question Answering Based Factuality Frameworks for Error Localization

This paper conducts the first such analysis and shows that, contrary to expectations, QA-based frameworks fail to correctly identify error spans in generated summaries and are outperformed by trivial exact match baselines.

CaPE: Contrastive Parameter Ensembling for Reducing Hallucination in Abstractive Summarization

CaPE improves performance across different automatic factual metrics and human evaluation, with the maximum improvement of 16.69% and 15.78% on summary-level dependency-arc entailment accuracy for the XSUM and CNN/DM datasets.

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

A modified summarization salience protocol, Atomic Content Units (ACUs), which relies onained semantic units and al-lows for high inter-annotator agreement is proposed, which has important implications for evaluating large language models (LLMs), as it shows that LLMs adjusted by human feedback may over-strained human evaluation.

On Improving Summarization Factual Consistency from Natural Language Feedback

This work collects a high-quality dataset, DeFacto, containing human demonstrations and informational feedback in natural language consisting of corrective instructions, edited summaries, and explanations with respect to the factual consistency of the summary, and evaluates if models can automatically correct factual inconsistencies in generated summaries.

Just ClozE! A Fast and Simple Method for Evaluating the Factual Consistency in Abstractive Summarization

This paper demonstrates that ClozE can reduce the evaluation time by nearly 96 % relative to QA-based metrics while retaining their interpretability and performance through experiments on six human-annotated datasets and a meta-evaluation benchmark GO FIGURE (Gabriel et al., 2020).

News Summarization and Evaluation in the Era of GPT-3

It is shown that not only do humans overwhelmingly prefer GPT-3 summaries, but these also do not suffer from common dataset-specific issues such as poor factuality, and both reference-based and reference-free automatic metrics cannot reliably evaluate zero-shot summaries.

Faithfulness in Natural Language Generation: A Systematic Survey of Analysis, Evaluation and Optimization Methods

This survey provides a systematic overview of the research progress on the faithfulness problem of NLG, including problem analysis, evaluation metrics and optimization methods, and organizes the evaluation and optimized methods for different tasks into a unified taxonomy to facilitate comparison and learning across tasks.



Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary

This work proposes a metric to evaluate the content quality of a summary using question-answering (QA), and identifies its performance bottlenecks and estimates that its potential upper-bound performance surpasses all other automatic metrics, approaching that of the gold-standard Pyramid Method.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

The contextual representations learned by the proposed replaced token detection pre-training task substantially outperform the ones learned by methods such as BERT and XLNet given the same model size, data, and compute.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

SQuAD: 100,000+ Questions for Machine Comprehension of Text

A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).

SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization

This work revisits the use of NLI for inconsistency detection, finding that past work suffered from a mismatch in input granularity between NLI datasets (sentence-level), and inconsistency detection (document level).

On Faithfulness and Factuality in Abstractive Summarization

It is found that neural abstractive summarization models are highly prone to hallucinate content that is unfaithful to the input document and textual entailment measures better correlate with faithfulness than standard metrics, potentially leading the way to automatic evaluation metrics as well as training and decoding criteria.

Neural Text Summarization: A Critical Evaluation

This work critically evaluate key ingredients of the current research setup: datasets, evaluation metrics, and models, and highlights three primary shortcomings: automatically collected datasets leave the task underconstrained and may contain noise detrimental to training and evaluation.

Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference

This paper evaluates summaries produced by state-of-the-art models via crowdsourcing and shows that such errors occur frequently, in particular with more abstractive models, which leads to an interesting downstream application for entailment models.

Single-dataset experts for multi-dataset qa

  • Empirical Methods in Natural Language Processing (EMNLP).
  • 2021

Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics

A typology of factual errors is devised and used to collect human annotations of generated summaries from state-of-the-art summarization systems for the CNN/DM and XSum datasets, showing their correlation with human judgement as well as their specific strengths and weaknesses.