QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization

  title={QAFactEval: Improved QA-Based Factual Consistency Evaluation for Summarization},
  author={Alexander R. Fabbri and Chien Sheng Wu and Wenhao Liu and Caiming Xiong},
  booktitle={North American Chapter of the Association for Computational Linguistics},
Factual consistency is an essential quality of text summarization models in practical settings. Existing work in evaluating this dimension can be broadly categorized into two lines of research, entailment-based and question answering (QA)-based metrics, and different experimental setups often lead to contrasting conclusions as to which paradigm performs the best. In this work, we conduct an extensive comparison of entailment and QA-based metrics, demonstrating that carefully choosing the… 

RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question

It is shown that RQUGE has a higher correlation with human judgment without re-lying on the reference question, and can improve the performance of QA models on out-of-domain datasets by tuning on the synthetic data generated by a question generation model and re-ranked by RQuge.

Shortcomings of Question Answering Based Factuality Frameworks for Error Localization

This paper conducts the first such analysis and shows that, contrary to expectations, QA-based frameworks fail to correctly identify error spans in generated summaries and are outperformed by trivial exact match baselines.

CaPE: Contrastive Parameter Ensembling for Reducing Hallucination in Abstractive Summarization

CaPE improves performance across different automatic factual metrics and human evaluation, with the maximum improvement of 16.69% and 15.78% on summary-level dependency-arc entailment accuracy for the XSUM and CNN/DM datasets.

Just ClozE! A Fast and Simple Method for Evaluating the Factual Consistency in Abstractive Summarization

This paper demonstrates that ClozE can reduce the evaluation time by nearly 96 % relative to QA-based metrics while retaining their interpretability and performance through experiments on six human-annotated datasets and a meta-evaluation benchmark GO FIGURE (Gabriel et al., 2020).

News Summarization and Evaluation in the Era of GPT-3

It is shown that not only do humans overwhelmingly prefer GPT-3 summaries, but these also do not suffer from common dataset-specific issues such as poor factuality, and both reference-based and reference-free automatic metrics cannot reliably evaluate zero-shot summaries.

Faithfulness in Natural Language Generation: A Systematic Survey of Analysis, Evaluation and Optimization Methods

This survey provides a systematic overview of the research progress on the faithfulness problem of NLG, including problem analysis, evaluation metrics and optimization methods, and organizes the evaluation and optimized methods for different tasks into a unified taxonomy to facilitate comparison and learning across tasks.

Consistency and Coherence from Points of Contextual Similarity

This work generalizes the ESTIME measure, making it applicable to any text-summary pairs, and observes that useful information exists in almost all of the layers except the several lowest ones, indicating that consistency and fluency qualities focused on local text details the most useful layers are close to the top.

Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors

These findings show that benchmarks built on modern summary outputs (those from pre-trained models) show significantly different results than benchmarks using pre-Transformer models, suggesting that system developers should take care to choose the right system for their task at hand.

Best-$k$ Search Algorithm for Neural Text Generation

Experiments on four NLG tasks show that best- k search yields more diverse and natural outputs compared to strong baselines, while the proposed algorithm maintains high text quality.

Improving Factual Consistency in Summarization with Compression-Based Post-Editing

This work proposes to use sentence-compression data to train the post-editing model to take a summary with extrinsic entity errors marked with special tokens and output a compressed, well-formed summary with those errors removed, and shows that this model improves factual consistency while maintaining ROUGE.



Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary

This work proposes a metric to evaluate the content quality of a summary using question-answering (QA), and identifies its performance bottlenecks and estimates that its potential upper-bound performance surpasses all other automatic metrics, approaching that of the gold-standard Pyramid Method.

ELECTRA: Pre-training Text Encoders as Discriminators Rather Than Generators

The contextual representations learned by the proposed replaced token detection pre-training task substantially outperform the ones learned by methods such as BERT and XLNet given the same model size, data, and compute.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

A new language representation model, BERT, designed to pre-train deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers, which can be fine-tuned with just one additional output layer to create state-of-the-art models for a wide range of tasks.

SQuAD: 100,000+ Questions for Machine Comprehension of Text

A strong logistic regression model is built, which achieves an F1 score of 51.0%, a significant improvement over a simple baseline (20%).

SummaC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization

This work revisits the use of NLI for inconsistency detection, finding that past work suffered from a mismatch in input granularity between NLI datasets (sentence-level), and inconsistency detection (document level).

On Faithfulness and Factuality in Abstractive Summarization

It is found that neural abstractive summarization models are highly prone to hallucinate content that is unfaithful to the input document and textual entailment measures better correlate with faithfulness than standard metrics, potentially leading the way to automatic evaluation metrics as well as training and decoding criteria.

Neural Text Summarization: A Critical Evaluation

This work critically evaluate key ingredients of the current research setup: datasets, evaluation metrics, and models, and highlights three primary shortcomings: automatically collected datasets leave the task underconstrained and may contain noise detrimental to training and evaluation.

Ranking Generated Summaries by Correctness: An Interesting but Challenging Application for Natural Language Inference

This paper evaluates summaries produced by state-of-the-art models via crowdsourcing and shows that such errors occur frequently, in particular with more abstractive models, which leads to an interesting downstream application for entailment models.

Single-dataset experts for multi-dataset qa

  • Empirical Methods in Natural Language Processing (EMNLP).
  • 2021

Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics

A typology of factual errors is devised and used to collect human annotations of generated summaries from state-of-the-art summarization systems for the CNN/DM and XSum datasets, showing their correlation with human judgement as well as their specific strengths and weaknesses.